Coder Social home page Coder Social logo

Comments (7)

tasket avatar tasket commented on September 26, 2024

To me, the selection of a backup tool running in dom0 comes down to three criteria:

  1. Does it need interactive access to the destination media/filesystem? If so, it cannot be used.

  2. Does it scan all data to find deltas? This is what most tools do, and its not terrible in usual practice because they can skip many small files according to modification date. But in dom0, nearly all our data is a handful of huge image files, so mod date becomes too coarse an indicator to be helpful in a majority of use cases. This is not a deal-breaker, but suffering it means the only efficiency gain we can anticipate is in storage space.

  3. Does the storage format allow old backups to be pruned arbitrarily by date, without compromising the integrity of the backup set? If not, the storage efficiency over non-incremental backups like qvm-backup will be marginal.

Factors being traded-off:

  • Space on destination
  • Space on source
  • Network intensity
  • CPU usage
  • Disk intensity
  • Manageability of backup sets
  • Overall speed

Factors that cannot be traded:

  • Dom0 isolation
  • Encryption and verification layer

Duplicity doesn't excel at any of the trade-off factors except space on source. On laptops using SSDs it will incur high CPU and disk activity (which I think is a poor tradeoff for this type of equipment), and repeated full backups will mean elevated network intensity and delays.

A person keeping hundreds of GB of static reference data should not suffer a large penalty when they modify a small detail in a large disk image. But I came to a conclusion that an optimal tool cannot be selected for this use case, because it doesn't exist. What's missing is a runtime (dom0) environment that efficiently flags deltas, like an archive-bit at the block level.

Snapshot-capable storage systems (Btrfs, ZFS, thin-provisioned LVM) do track deltas in a fine-grained way (even for large image files and volumes), but they force the user to hang onto a significant amount of old data on the source PC. Their backup tools (i.e. btrfs-send) also assume some interactive control of the destination media, so isolation is an issue.

Apple's Time Machine may represent the best trade-off for all factors, since it was designed for high-frequency backups over the Internet as well as ability to manage sets without decrypting them. Time Machine uses sparsebundle bands to chunk data, which has a number of benefits. Although TM assumes interactive control of the destination, backup sets consisting of sparsebundles are flexible enough to be managed without full interactive access (e.g. a backupVM could easily perform necessary hardlinks or deletions without any back-and-forth between it and dom0, and increased trust isn't required for those tasks).

Something like Time Machine could be cobbled together using existing tools, but it would involve a FUSE filesystem which is not efficient for normal PC operations. It could be done better.

from qubes-incremental-backup-poc.

tasket avatar tasket commented on September 26, 2024

Here is a list of backup tools that may help you with your own list: https://github.com/restic/others

And another: http://changelog.complete.org/archives/9353-roundup-of-remote-encrypted-deduplicated-backups-in-linux

The ones that have caught my attention for possible use on Qubes are restic, Zbackup, ddar and bup. But they all have significant trade-offs.

from qubes-incremental-backup-poc.

v6ak avatar v6ak commented on September 26, 2024

from qubes-incremental-backup-poc.

tasket avatar tasket commented on September 26, 2024

Hello,
Thank you for the list and for th comments.
To me, the selection of a backup tool running in dom0 comes down to
three criteria:

  1. Does it need interactive access to the destination media/filesystem?
    If so, it cannot be used.
    It depends what you mean by filesystem. Some backends (e.g., Duplicati and Duplicity) have a rather a limited set of basic operations (list, get, put, delete). In my QubesInterVMBackend for Duplicity, I allow only limited set of characters in order to mitigate attacks by malformed filenames. If the files are properly authenticated and properly checked (which is something I would want anyway for obvious reasons), I don't see any problem here. If they aren't properly checked, then we have troubles when restoring.

Storage complexity is part of the risk, yes, but so is the type of transaction... which sounds interactive (not push) in this case.

I believe best practice here would be to follow qvm-backup's example and have dom0 push data and commands to a backupVM, one-way. Only exception would be reception of short status codes (success/fail) and non-parsed feedback one typically sees with qvm-run.

As I mentioned earlier, I believe the push model is possible with incremental backup tools, just not all of them.

Especially if we implement Merkle-tree-based authentication (which is something I want anyway), there is virtually no attack surface. Well, attacker that controls the storage or BackupStorageVM could still interrupt the restore process or remove backups, but not anything worse.

Sounds interesting.

  1. Does it scan all data to find deltas? This is what most tools do,
    and its not terrible in usual practice because they can skip many small
    files according to modification date. But in dom0, nearly all our data
    is a handful of huge image files, so mod date becomes too coarse an
    indicator to be helpful in a majority of use cases. This is not a
    deal-breaker, but suffering it means the only efficiency gain we can
    anticipate is in storage space.
    Good point. However, I am not planning to backup VMs from dom0 on this level.

Ah. Readme was not terribly clear on that point, so I assumed backup was handling image files directly.

Note this is still an efficiency issue for users with large files: databases, video footage, etc.

  1. Does the storage format allow old backups to be pruned arbitrarily
    by date, without compromising the integrity of the backup set? If not,
    the storage efficiency over non-incremental backups like qvm-backup
    will be marginal.
    Also good point. But I don't think it is marginal advantage:
  • Imagine you perform full backup once per three months and incremental backup once per week (or even more often). It the weekly backup is going to be tiny compared to the full backup. (Depends on how fast are the data changing.)

This leads to nasty surprises, however, when space is not monitored carefully and one cannot quite fit an incremental session on backup media... if you 'prune', you may be faced with erasing a full backup--and then performing one--or at least erasing days worth of incremental data before the current state can be backed-up via a larger/longer session. These are bad choices to give the user.

A true pruning capability means the size of the current backup session won't increase when backup disk space has to be recovered. And it means recent backups will likely be preserved when space becomes tight..... but in such a situation the user can choose any sessions for deletion with no additional impacts (user has to think only about which dates are no longer valuable, or allow backup tool to automatically remove oldest, etc.).

  • File-based don't backup free space and can exclude ~/.cache etc., so even full backup is more tiny.

Even qvm-backup skips unallocated blocks. But skipping .cache is a good point; I wish Qubes treated it as a separate volume.

Factors that cannot be traded:

  • Dom0 isolation
    I agree. Even now, I am trying to aggressively sanitize data not only for dom0. I want to use acsii over utf-8 where possible, limit unlimited-size buffers etc.

If there is complex structure/meaning in the ascii, it can be a hazard anyway... word of caution.

Something like Time Machine could be cobbled together using existing
tools, but it would involve a FUSE filesystem which is not efficient
for normal PC operations. It could be done better.
Sure. But you can opt to use it just for some VMs. When I implement using multiple backend, you will be able to pick various tradeoffs for various VMs.

That would be nice. (Sorry for the flattened reply structure... I used GH quoting function.)

from qubes-incremental-backup-poc.

v6ak avatar v6ak commented on September 26, 2024

from qubes-incremental-backup-poc.

v6ak avatar v6ak commented on September 26, 2024

I have tried Borg. It looks like deduplicated and compressed full FS snapshots. Like Merkle tree, but DAG instead of tree. It seems to support prune well. So far, cool.

Deduplication is performed on chunks smaller than one file. However, the number of files stored in the repo/data (so-called “segments”) seems to be significantly lower than number of unique chunks (or even than number of source files). It seems that every segment is created once and then never updated.

If prune deletes reference to a directory node and performs GC, then it might need to reupload and reorganize many segments, just because of removing one chunk from the segment. It definitely does not come for free.

When looking at backup of my /usr, those files vary between 2.5MiB and 13MiB. (Hope it is not a viable side channel…) The official documentation mentions 5MiB file size. When backing up 100GiB, this seems to result in roughly 20K files, which is not a small number, but it might be acceptable. If Borg used larger segments, then more reorganization would be needed.

The challenge for Borg comes with using a custom storage backend – we want neither file-based storage nor SSH-based transmission. I see two/three ways there:

a. Create some server that proxies between Borg RPC and BackupStorageVM. This would require to implement 18 methods of Borg RPC and to listen SSH on loopback. Sound insane.
b. Create FUSE-based filesystem that does the same. Maybe easier, maybe more hacky, but probably more universal, because any backup software that can backup to filesystem would be able to use it.
c. Patch Borg. I don't see much advantages over option a.

A note on running file-based backup on img files in dom0: After all, maybe this is not a bad idea for some special cases, though it is something different than I originally planned. A pitfall: The backup backend would have to be able to treat block devices as files, as Qubes 4 switches to LVM. Another pitfall: I would not recommend doing this on a running VM without cloning the volume.

from qubes-incremental-backup-poc.

v6ak avatar v6ak commented on September 26, 2024

We can hardly add additional backends before #37. Well, we theoretically could, but we would have to, for example, authenticate the backend name, which is not so easy.

I am not against discussing it now. I am just explaining what we are waiting for.

Currently, Borg looks well (but integration with storage backend would be a bit painful), Duplicati and Restic seem to be worth trying.

from qubes-incremental-backup-poc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.