When copying between two remote registries skopeo copy</code

[feature request][performance] sort layers by size before remote-to-remote copy about image HOT 8 OPEN

mirekphd commented on August 16, 2024

[feature request][performance] sort layers by size before remote-to-remote copy

from image.

Comments (8)

mtrmac commented on August 16, 2024 1

If I understand it correctly, the idea is that a layer that is 50% of the total image size should be started first, so that the others can all be pulled in parallel with the big one, and the total time is about the same as the time to pull the big layer, instead of spending time copying 10 smaller layers, and only after most of that is done, starting the big layer.

That only makes a difference for moderately unbalanced images, where the largest layer is probably > 1/6 of the total image size, but not something like 99%.

I think it’s an interesting optimization worth exploring. We can’t/shouldn’t do that for c/storage, and due to compatibility we’d need an opt-in anyway, but that’s not too bad.

We might want to think about the UI impact — e.g. should we list the progress bars in the original image order, to show the user what’s going on? Currently we create the progress bars only when we start pulling, in order. That might end up being the most complex part of the feature.

from image.

vrothberg commented on August 16, 2024

Thanks for reaching out.

Can you elaborate on why the order matters?

As for pulls: the order must be preserved as the layers must be applied to the local storage in the exact order.

from image.

mirekphd commented on August 16, 2024

Adequate unbalancing is guaranteed in many containerized python applications for example, which have to be based on Ubuntu, so the base image layer is much larger than the application layers (all the way up to the NVIDIA CUDA images with their astoundingly heavy 3.5... GB base images). The problem is if the unbalanced images are pre-sorted already, and this unfortunately looks likely, as the base layer is first already, so the size-sorting might not make much of a difference in practice.

On the other hand, the forking has to be done anyway, and altering its sequence does not add any extra overhead, so unless there is some noticeable overhead on gathering layer sizes and sorting them or on accessing server-side layers "out of order", this new method should be always outperforming the current method, regardless of how small or unnoticeable (and performance gains should be double, because they should be also achievable during the push phase). I suspect the main reason why this is has not been done already like this is the way in which the legacy system from which skopeo inherited operates. The docker pull however has a very different use case - to run the container after the pull is complete, rather than to immediately push it somewhere else.

from image.

mtrmac commented on August 16, 2024

The way c/storage is set up, pulls must create layers from base to the last child, in order (they have parent links).

Now, whether that’s a 100% hard requirement, where we just can’t create the child before the parent, or more of an implementation choice, depends on the graph driver (it‘s 100% hard for device-mapper-snapshots, and it might be a choice for overlay, but I’m not quite sure). Even if it were 100% an implementation choice, that would be a pretty large implementation effort (we would need to have a concept of an extracted diff that is not yet a layer, a mechanism to turn that into a layer quickly, and a cleanup mechanism to delete that extracted diff on unexpected aborts).

For direct registry-to-registry copies, this should be quite easy to do; the progress UI is the hardest part, the rest is just mechanical work. (But note that such copies are not pulls+pushes with a disk intermediary; they are direct streaming copies, so there are no “double” gains.)

For pushes, I think it’s same as registry-to-registry copies, but there’s a small chance I’m missing something.

from image.

github-actions commented on August 16, 2024

A friendly reminder that this issue had no activity for 30 days.

from image.

rhatdan commented on August 16, 2024

You would also take up more temporary space as the blobs would exist on disk for a longer point of time. Currently once a blob is downloaded, completely that layer is applied to storge and the layer is removed.

But if this is a minor change, I think we should do it.

from image.

github-actions commented on August 16, 2024

A friendly reminder that this issue had no activity for 30 days.

from image.

mtrmac commented on August 16, 2024

Moving to c/image; this would be transparent to Skopeo itself.

from image.

[feature request][performance] sort layers by size before remote-to-remote copy about image HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent