Coder Social home page Coder Social logo

EOF issue with zfsbackup-go about blazer HOT 24 OPEN

kurin avatar kurin commented on July 16, 2024
EOF issue with zfsbackup-go

from blazer.

Comments (24)

kurin avatar kurin commented on July 16, 2024 1

Oooh okay so what's happening is that to resume the large file, the checksums of the already-uploaded chunks have to match the checksums of the corresponding chunks of the current file; if not you're really uploading a different file.

The reason your checksums don't match, I'm pretty sure, is because the snapshot you're uploading is encrypted, and the encryption is of course resulting in a different data stream each time.

This library is designed so that clients who receive errors will by default simply retry forever; the only permanent failure should be if the process itself dies. @someone1 do you have examples of the kinds of failures that led you to set w.Resume = true?

I need to think about whether there's a better way to handle non-matching resumes. I think it might be a good idea to add an option, when there's a conflict, to remove the partially uploaded file and start from scratch.

from blazer.

kurin avatar kurin commented on July 16, 2024

This feels like an issue in Blazer but let me see if I can get a test case failing. This code is possibly more complicated than it ought to be ๐Ÿ™„

from blazer.

kurin avatar kurin commented on July 16, 2024

I suspect you've surfaced a bug where if Resume is true but there's nothing to resume it just blows up.

from blazer.

kurin avatar kurin commented on July 16, 2024

Thanks for catching this! Try with the latest commit, you should be good.

from blazer.

someone1 avatar someone1 commented on July 16, 2024

Apologies for the tardy chiming in - I added that bit since retries after a network failure would result in incomplete open large file uploads. When I made the change I thought blazer would know to start a new file if there was nothing to resume (may have just read the code wrong at the time!)

Anyway, thanks for all the debugging @lhoward and for the quick fix @kurin !

from blazer.

lhoward avatar lhoward commented on July 16, 2024

This fix unfortunately gets it stuck in an endless loop with

2019/01/26 15:48:19 b2: Opening volume tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Entering tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Uploading tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
2019/01/26 15:48:19 b2: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled

from blazer.

kurin avatar kurin commented on July 16, 2024

Does this happen with any invocation? I have a zfs box around here somewhere I can try to test this on, but if you can reproduce it with the environment variable B2_LOG_LEVEL=2 you'll get a lot of debug lines that you could add to the bug (be sure to purge of secret data).

from blazer.

lhoward avatar lhoward commented on July 16, 2024

I wasn't exactly sure what data to consider secret, so for now I'll just paste in a summary โ€“ย happy to send you the complete details privately if you mail me (lukeh at padl.com).

  • b2_list_unfinished_large_files (returns a fileId, nextFileId == null)
  • b2_list_parts (returns a bunch)
  • b2_get_upload_part_url (returns upload URLs for a bunch of fileIDs)

The key appears to be this ("chunks don't match"):

2019/01/26 18:21:44 error writing tank/Git|20190126.zstream.gz.pgp.vol1: resumable upload was requested, but chunks don't match
2019/01/26 18:21:44 << b2_get_upload_part_url (14) 200  {Cache-Control: max-age=0, no-cache, no-store; Content-Type: application/json;charset=utf-8; Content-Length: 376; Date: Sat, 26 Jan 2019 07:31:33 GMT} ({
  "authorizationToken": "[redacted]",
  "fileId": "some-fileid-redacted",
  "uploadUrl": "https://pod-000-1118-16.backblaze.com/b2api/v1/b2_upload_part/some-fileid-redacted/0055"
}
)
2019/01/26 18:21:44 >> b2_finish_large_file uri: https://api002.backblazeb2.com/b2api/v1/b2_finish_large_file headers: {X-Blazer-Request-Id: 17;User-Agent: blazer/0.5.1} args: ({"fileId":"some-fileid-redacted","partSha1Array":["67a935b099a061116bed27ba1ea8403fc3fe50e4","71dc3437d1dfa17cf36e1da1d9373fc875bd37ce","b5c1c32096584dce8837ee26c0d371d6a69b6bd5","826363c88715be8e4aa2190c91e0c0bfed4b7941","a36bdad6cc27bfa4e26a69a3501ad409a3e999f2","780c49397e0f2d64d7012eeeffe293308a2376c1","a6a79531a551b1ca2f0e9c0287da78d2fd456ed6","cb8ae6c70cd4f7b7e7593aa6c011c966a6224c69","fb3ddd9a9e2d32bb8c676025f43e3c8133a9fc5e","53ce7c6a8f6da7a9d6fdefdac4e95066b38a8cea"]})
2019/01/26 18:21:44 b2 backend: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
2019/01/26 18:21:44 b2: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled

from blazer.

lhoward avatar lhoward commented on July 16, 2024

So setting resume to false should fix it? There is a resume command line option to zfsbackup BTW but Iโ€™m not setting that. And yes I am encrypting.

from blazer.

lhoward avatar lhoward commented on July 16, 2024

Is this resuming within or across a zfsbackup session, I wonder? In the latter case it seems like it should be disabled if encryption is on (even if itโ€™s off, unless the volumes are cached locally possibly an unencrypted backup may differ?) just speculating.

from blazer.

kurin avatar kurin commented on July 16, 2024

I'm not familiar with zfsbackup-go but it sounds to me like it's starting the upload from the beginning with w.Resume = true, which works when the stream is the same, but fails when the stream is encrypted, because the same plaintext becomes different cyphertext.

I'm wondering what's happening to cause zfsbackup-go to restart from the beginning, however, because that shouldn't be necessary. If the backup is interrupted (e.g. if the network drops), the client should just hang until service is restored.

from blazer.

lhoward avatar lhoward commented on July 16, 2024

If in zfsbackup-go's B2Backend.Upload(), I set w.Resume to false, the backup completes. @someone1 are there any risks in making this change?

from blazer.

someone1 avatar someone1 commented on July 16, 2024

@kurin - I think it was a mistake to add Resume=true as the default option in zfsbackup-go precisely for the reason you described. I had a user complain that they saw multiple versions of files in their B2 buckets where some were not accessible and assumed this was due to incomplete large file uploads. I don't actually see a way in blazer to cancel a large file upload and my goal was to just resume or start a new upload.

I think it would make sense to provide the following in the blazer:

  • If a large file upload fails for whatever reason and an option such as AbortOnFailure=true, then blazer should automatically cancel the large file upload.
  • Add an option as you described to clear out any partially uploaded files when starting an upload (this should obviously be mutually exclusive to the Resume option).

I'd love to have my code better tested against B2 - the bonfire package you have seems interesting in this regard - can it be used to setup a local B2 emulator to write tests against? If so, mind sharing a quick example how?

from blazer.

kurin avatar kurin commented on July 16, 2024

I don't believe B2 has a way to automatically cancel large files.

The only permanent failure for the writer should be a cancelled context. Any other failure ought to be recoverable, and so wouldn't benefit from an abort-on-failure option, and when the context is cancelled we can't use it to clean up after ourselves.

It's possible to list unfinished files (with b2.ListUnfinished() option) and then delete them. Thinking about it, I don't think I want to add an option that lets callers set Resume but also removes the existing large file on chunk mismatch. When a caller requests Resume, I think it's better for it to fail permanently than to silently remove data.

Bonfire is an experimental reimplementation of the B2 server API that I haven't had time to work on in a while.

from blazer.

someone1 avatar someone1 commented on July 16, 2024

B2 does not automatically cancel large files - but it does have a b2_cancel_large_file call which is not exposed via blazer. The Writer interface already handles chunking a file, initiating a large file upload, and completing it for you, can it be updated to also cancel the upload if any failures occur (or expose a call that does)? In addition to a canceled context, wouldn't a permanent network failure or internal server errors from backblaze eventually fail any sort of retry/backoff attempts blazer makes or does it operate in such a way that it will NEVER return an error? I know GCS/S3 Go SDKs will eventually fail outside of a canceled context. The automatic cleanup would be great for situations where the power is pulled and large files are left but that's a backblaze feature, not blazer (S3 uses lifecycle rules for this).

The AWS SDK for Go's s3manager will start a multipart upload and cancel it should it exhaust retries (network failures, cancelled context, etc.) This is helpful as a high-level API where the details surrounding running and managing a multipart upload are hidden from the developer. I thought it'd make sense as an option to the blazer Writer.

The options I was suggesting are (naming aside):

  • CancelOnError - this would simply call b2_cancel_large_file if the Writer errors out
  • ClearUnfinished - this would check if an unfinished upload exists for this object before uploading (similar to Resume), and if so, cancel/delete it before uploading. This option is mutually exclusive to Resume (you can't have both)

If you think this is better left outside the Writer interface, I understand - I like the feature of CancelOnError like the S3 SDK provides but can work around it. The second option was suggested since backblaze doesn't have a way to automatically clear unfinished uploads after a time period. Is the proper way to cancel an unfinished large upload to b2 without knowledge of the large file session id to list the unfinished object(s) and call Delete on it?

from blazer.

kurin avatar kurin commented on July 16, 2024

The B2 integration checklist specifies that some errors (including network errors) should be handled with exponentially backed-off retries. So in that sense blazer will retry forever even given network failures until a context is cancelled or the process exits.

It turns out some blazer clients would actually prefer just to receive and handle errors themselves, and I've kicked around some ideas for how to implement that (#55) but nothing's been done. It probably makes sense to add the ability to say "if NO progress has been made in , then kill the upload".

At the very least I should expose the b2_cancel_large_file request. I can't think of a way to gracefully handle the dead-context situation; it might be best to make an unexported cancel that doesn't rely on a context at all.

from blazer.

lhoward avatar lhoward commented on July 16, 2024

In the interim, backups with w.Resume = false are working for me. Thank you for the detail analysis above, I'm certainly learning something!

from blazer.

someone1 avatar someone1 commented on July 16, 2024

I agree there should be some way to call b2_cancel_large_file - looking forward to an implementation!

As for error handling/retires - given the b2 integration checklist, I think the current behavior is to spec and therefore fine (assuming it errors out for any non-408 400 error codes, seems like everything else should be retried). I can cancel requests using the Context I pass in.

One thing might be worth adjusting though: Instead of using time.Sleep here, to be more context sensitive it should probably be:

select {
case <-w.ctx.Done():
  err = w.ctx.Err()
case <-time.After(sleep):
   // Remainder of code  from 169-182
}

from blazer.

kurin avatar kurin commented on July 16, 2024

Yeah that's probably a good idea.

from blazer.

kurin avatar kurin commented on July 16, 2024

So I've added a thing for this:

https://godoc.org/github.com/kurin/blazer/b2#WithCancelOnError

I'm not super happy about the interface though. The idea with ctxf is that you probably want a new context if the old one is cancelled, and probably with a short deadline on it so that your writer doesn't hang forever even after you cancel the writer's context. Play around with this and see how it fits your needs.

from blazer.

someone1 avatar someone1 commented on July 16, 2024

Sorry to come back to this issue so late - finally updating zfsbackup-go and while updating my backend code all around I noticed how the google cloud storage handled all sort of situations using a cancelled context and wonder if it would be a good fit here: https://pkg.go.dev/cloud.google.com/go/storage?tab=doc#Writer.CloseWithError (read the deprecation notice)

Basically, instead of using an option or anything of the sort, if the context.Context passed to NewWriter is cancelled - then the upload would cancel. In the case of b2 it would call b2_cancel_large_file if it was a large file upload, but even if it wasn't uploading a large file and was partially through uploading a small file - it would cancel it so the object is never created.

I'm using the new option you added, cancelling the context manually on error, and just opting not to call Close if I encounter any issues. If this isn't how I should be using the new options/client. please let me know!

from blazer.

kurin avatar kurin commented on July 16, 2024

Automatically calling b2_cancel_large_file is fraught if, for example, the user doesn't want the large file cancelled, which is a reasonable thing to expect.

You should always call Close irrespective of whether you've cancelled the context or otherwise encountered an error.

from blazer.

someone1 avatar someone1 commented on July 16, 2024

Hmmm - the problem I'm experiencing is if I start an upload and there's an error while writing it to b2, calling Close will save the partially written file - my preference is for the file to not be written in this situation.

For the GCS client, I just cancel the context.Context I passed into NewWriter and the object is never written to the bucket. Internally, the library is listening for the context to be cancelled and aborts the upload, cleaning up as required.

My test is somewhat contrived, but I basically try to io.Copy into the Writer from a source that will always error on Read - instead of calling w.Close() in the error handling code as recommended - I just return, cancelling the context.Context which seems to prevent the file from being written to b2. However, I can't tell if blazer is expecting this type of interaction and will abort the write as required if Close is never called.

Maybe there's an edge case here since io.Copy will never have started the upload so when I call Close, blazer calls init and creates an empty file in b2?

from blazer.

kurin avatar kurin commented on July 16, 2024

I'd need to think about whether we can infer from a cancelled context whether the user does or does not want to clean up the underlying file. On the one hand it would suck to get partway through a huge file just to have a ctrl-c or something undo all your work, but on the other hand as you say we don't want to leave partials lying around.

Today, if you cancel the context and call Close(), the file should be left in an unwritten state; it can't complete simpleWriteFile() or finishLargeFile() with a cancelled context. So you will have a partially written large file, if you had completed any chunks, but you shouldn't have a "completed" large or small file. Are you seeing this behavior instead?

from blazer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.