Comments (24)
Oooh okay so what's happening is that to resume the large file, the checksums of the already-uploaded chunks have to match the checksums of the corresponding chunks of the current file; if not you're really uploading a different file.
The reason your checksums don't match, I'm pretty sure, is because the snapshot you're uploading is encrypted, and the encryption is of course resulting in a different data stream each time.
This library is designed so that clients who receive errors will by default simply retry forever; the only permanent failure should be if the process itself dies. @someone1 do you have examples of the kinds of failures that led you to set w.Resume = true
?
I need to think about whether there's a better way to handle non-matching resumes. I think it might be a good idea to add an option, when there's a conflict, to remove the partially uploaded file and start from scratch.
from blazer.
This feels like an issue in Blazer but let me see if I can get a test case failing. This code is possibly more complicated than it ought to be
from blazer.
I suspect you've surfaced a bug where if Resume is true but there's nothing to resume it just blows up.
from blazer.
Thanks for catching this! Try with the latest commit, you should be good.
from blazer.
Apologies for the tardy chiming in - I added that bit since retries after a network failure would result in incomplete open large file uploads. When I made the change I thought blazer would know to start a new file if there was nothing to resume (may have just read the code wrong at the time!)
Anyway, thanks for all the debugging @lhoward and for the quick fix @kurin !
from blazer.
This fix unfortunately gets it stuck in an endless loop with
2019/01/26 15:48:19 b2: Opening volume tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Entering tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Uploading tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
2019/01/26 15:48:19 b2: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
from blazer.
Does this happen with any invocation? I have a zfs box around here somewhere I can try to test this on, but if you can reproduce it with the environment variable B2_LOG_LEVEL=2 you'll get a lot of debug lines that you could add to the bug (be sure to purge of secret data).
from blazer.
I wasn't exactly sure what data to consider secret, so for now I'll just paste in a summary โย happy to send you the complete details privately if you mail me (lukeh at padl.com).
- b2_list_unfinished_large_files (returns a fileId, nextFileId == null)
- b2_list_parts (returns a bunch)
- b2_get_upload_part_url (returns upload URLs for a bunch of fileIDs)
The key appears to be this ("chunks don't match"):
2019/01/26 18:21:44 error writing tank/Git|20190126.zstream.gz.pgp.vol1: resumable upload was requested, but chunks don't match
2019/01/26 18:21:44 << b2_get_upload_part_url (14) 200 {Cache-Control: max-age=0, no-cache, no-store; Content-Type: application/json;charset=utf-8; Content-Length: 376; Date: Sat, 26 Jan 2019 07:31:33 GMT} ({
"authorizationToken": "[redacted]",
"fileId": "some-fileid-redacted",
"uploadUrl": "https://pod-000-1118-16.backblaze.com/b2api/v1/b2_upload_part/some-fileid-redacted/0055"
}
)
2019/01/26 18:21:44 >> b2_finish_large_file uri: https://api002.backblazeb2.com/b2api/v1/b2_finish_large_file headers: {X-Blazer-Request-Id: 17;User-Agent: blazer/0.5.1} args: ({"fileId":"some-fileid-redacted","partSha1Array":["67a935b099a061116bed27ba1ea8403fc3fe50e4","71dc3437d1dfa17cf36e1da1d9373fc875bd37ce","b5c1c32096584dce8837ee26c0d371d6a69b6bd5","826363c88715be8e4aa2190c91e0c0bfed4b7941","a36bdad6cc27bfa4e26a69a3501ad409a3e999f2","780c49397e0f2d64d7012eeeffe293308a2376c1","a6a79531a551b1ca2f0e9c0287da78d2fd456ed6","cb8ae6c70cd4f7b7e7593aa6c011c966a6224c69","fb3ddd9a9e2d32bb8c676025f43e3c8133a9fc5e","53ce7c6a8f6da7a9d6fdefdac4e95066b38a8cea"]})
2019/01/26 18:21:44 b2 backend: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
2019/01/26 18:21:44 b2: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
from blazer.
So setting resume to false should fix it? There is a resume command line option to zfsbackup BTW but Iโm not setting that. And yes I am encrypting.
from blazer.
Is this resuming within or across a zfsbackup session, I wonder? In the latter case it seems like it should be disabled if encryption is on (even if itโs off, unless the volumes are cached locally possibly an unencrypted backup may differ?) just speculating.
from blazer.
I'm not familiar with zfsbackup-go but it sounds to me like it's starting the upload from the beginning with w.Resume = true
, which works when the stream is the same, but fails when the stream is encrypted, because the same plaintext becomes different cyphertext.
I'm wondering what's happening to cause zfsbackup-go to restart from the beginning, however, because that shouldn't be necessary. If the backup is interrupted (e.g. if the network drops), the client should just hang until service is restored.
from blazer.
If in zfsbackup-go's B2Backend.Upload()
, I set w.Resume
to false, the backup completes. @someone1 are there any risks in making this change?
from blazer.
@kurin - I think it was a mistake to add Resume=true
as the default option in zfsbackup-go precisely for the reason you described. I had a user complain that they saw multiple versions of files in their B2 buckets where some were not accessible and assumed this was due to incomplete large file uploads. I don't actually see a way in blazer to cancel a large file upload and my goal was to just resume or start a new upload.
I think it would make sense to provide the following in the blazer:
- If a large file upload fails for whatever reason and an option such as
AbortOnFailure=true
, then blazer should automatically cancel the large file upload. - Add an option as you described to clear out any partially uploaded files when starting an upload (this should obviously be mutually exclusive to the
Resume
option).
I'd love to have my code better tested against B2 - the bonfire package you have seems interesting in this regard - can it be used to setup a local B2 emulator to write tests against? If so, mind sharing a quick example how?
from blazer.
I don't believe B2 has a way to automatically cancel large files.
The only permanent failure for the writer should be a cancelled context. Any other failure ought to be recoverable, and so wouldn't benefit from an abort-on-failure option, and when the context is cancelled we can't use it to clean up after ourselves.
It's possible to list unfinished files (with b2.ListUnfinished()
option) and then delete them. Thinking about it, I don't think I want to add an option that lets callers set Resume
but also removes the existing large file on chunk mismatch. When a caller requests Resume
, I think it's better for it to fail permanently than to silently remove data.
Bonfire is an experimental reimplementation of the B2 server API that I haven't had time to work on in a while.
from blazer.
B2 does not automatically cancel large files - but it does have a b2_cancel_large_file call which is not exposed via blazer. The Writer interface already handles chunking a file, initiating a large file upload, and completing it for you, can it be updated to also cancel the upload if any failures occur (or expose a call that does)? In addition to a canceled context, wouldn't a permanent network failure or internal server errors from backblaze eventually fail any sort of retry/backoff attempts blazer makes or does it operate in such a way that it will NEVER return an error? I know GCS/S3 Go SDKs will eventually fail outside of a canceled context. The automatic cleanup would be great for situations where the power is pulled and large files are left but that's a backblaze feature, not blazer (S3 uses lifecycle rules for this).
The AWS SDK for Go's s3manager will start a multipart upload and cancel it should it exhaust retries (network failures, cancelled context, etc.) This is helpful as a high-level API where the details surrounding running and managing a multipart upload are hidden from the developer. I thought it'd make sense as an option to the blazer Writer.
The options I was suggesting are (naming aside):
CancelOnError
- this would simply callb2_cancel_large_file
if the Writer errors outClearUnfinished
- this would check if an unfinished upload exists for this object before uploading (similar to Resume), and if so, cancel/delete it before uploading. This option is mutually exclusive toResume
(you can't have both)
If you think this is better left outside the Writer interface, I understand - I like the feature of CancelOnError
like the S3 SDK provides but can work around it. The second option was suggested since backblaze doesn't have a way to automatically clear unfinished uploads after a time period. Is the proper way to cancel an unfinished large upload to b2 without knowledge of the large file session id to list the unfinished object(s) and call Delete
on it?
from blazer.
The B2 integration checklist specifies that some errors (including network errors) should be handled with exponentially backed-off retries. So in that sense blazer will retry forever even given network failures until a context is cancelled or the process exits.
It turns out some blazer clients would actually prefer just to receive and handle errors themselves, and I've kicked around some ideas for how to implement that (#55) but nothing's been done. It probably makes sense to add the ability to say "if NO progress has been made in , then kill the upload".
At the very least I should expose the b2_cancel_large_file request. I can't think of a way to gracefully handle the dead-context situation; it might be best to make an unexported cancel that doesn't rely on a context at all.
from blazer.
In the interim, backups with w.Resume = false
are working for me. Thank you for the detail analysis above, I'm certainly learning something!
from blazer.
I agree there should be some way to call b2_cancel_large_file - looking forward to an implementation!
As for error handling/retires - given the b2 integration checklist, I think the current behavior is to spec and therefore fine (assuming it errors out for any non-408 400 error codes, seems like everything else should be retried). I can cancel requests using the Context I pass in.
One thing might be worth adjusting though: Instead of using time.Sleep
here, to be more context sensitive it should probably be:
select {
case <-w.ctx.Done():
err = w.ctx.Err()
case <-time.After(sleep):
// Remainder of code from 169-182
}
from blazer.
Yeah that's probably a good idea.
from blazer.
So I've added a thing for this:
https://godoc.org/github.com/kurin/blazer/b2#WithCancelOnError
I'm not super happy about the interface though. The idea with ctxf is that you probably want a new context if the old one is cancelled, and probably with a short deadline on it so that your writer doesn't hang forever even after you cancel the writer's context. Play around with this and see how it fits your needs.
from blazer.
Sorry to come back to this issue so late - finally updating zfsbackup-go and while updating my backend code all around I noticed how the google cloud storage handled all sort of situations using a cancelled context and wonder if it would be a good fit here: https://pkg.go.dev/cloud.google.com/go/storage?tab=doc#Writer.CloseWithError (read the deprecation notice)
Basically, instead of using an option or anything of the sort, if the context.Context
passed to NewWriter
is cancelled - then the upload would cancel. In the case of b2 it would call b2_cancel_large_file
if it was a large file upload, but even if it wasn't uploading a large file and was partially through uploading a small file - it would cancel it so the object is never created.
I'm using the new option you added, cancelling the context manually on error, and just opting not to call Close
if I encounter any issues. If this isn't how I should be using the new options/client. please let me know!
from blazer.
Automatically calling b2_cancel_large_file
is fraught if, for example, the user doesn't want the large file cancelled, which is a reasonable thing to expect.
You should always call Close
irrespective of whether you've cancelled the context or otherwise encountered an error.
from blazer.
Hmmm - the problem I'm experiencing is if I start an upload and there's an error while writing it to b2, calling Close
will save the partially written file - my preference is for the file to not be written in this situation.
For the GCS client, I just cancel the context.Context
I passed into NewWriter
and the object is never written to the bucket. Internally, the library is listening for the context to be cancelled and aborts the upload, cleaning up as required.
My test is somewhat contrived, but I basically try to io.Copy
into the Writer
from a source that will always error on Read
- instead of calling w.Close()
in the error handling code as recommended - I just return, cancelling the context.Context
which seems to prevent the file from being written to b2. However, I can't tell if blazer
is expecting this type of interaction and will abort the write as required if Close
is never called.
Maybe there's an edge case here since io.Copy
will never have started the upload so when I call Close
, blazer calls init
and creates an empty file in b2?
from blazer.
I'd need to think about whether we can infer from a cancelled context whether the user does or does not want to clean up the underlying file. On the one hand it would suck to get partway through a huge file just to have a ctrl-c or something undo all your work, but on the other hand as you say we don't want to leave partials lying around.
Today, if you cancel the context and call Close(), the file should be left in an unwritten state; it can't complete simpleWriteFile()
or finishLargeFile()
with a cancelled context. So you will have a partially written large file, if you had completed any chunks, but you shouldn't have a "completed" large or small file. Are you seeing this behavior instead?
from blazer.
Related Issues (20)
- Add an option to turn off error handling
- Unable to create an empty file HOT 7
- There is a race condition when a writer is cancelled. HOT 1
- No docs on creating Bucket object
- Slightly extend `base` interface for stateless programs HOT 4
- How to set custom info headers
- Get a file at a specific version (ID) HOT 3
- How to perform Delete operation on sequence of .JPEG files? HOT 2
- Error while trying to create *b2.Bucket object - bucket not found HOT 3
- Generating Auth Token and re-using it for list of image files stored on B2 bucket
- Question about the state of this library. HOT 1
- Operations on 0-byte-files fail with 416: out of content HOT 6
- Is bonfire able to be used as a local emulator? HOT 3
- Possible crashing bug HOT 2
- Infinite loop in NewClient() if system certificates are not installed HOT 1
- Supply optional bucketName parameter to listBuckets HOT 1
- Deleting a file downloads it first HOT 2
- go-fuzz is deprecated with Go 1.18
- Upload file with custom content type? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blazer.