Coder Social home page Coder Social logo

Comments (21)

aggieNick02 avatar aggieNick02 commented on June 2, 2024

Reproducing seems somewhat dependent on size. In limited testing, bigger files are more likely to hit. I added LFS_DEBUG_HTTP=1 and GIT_CURL_VERBOSE=1 and now can see details about the failure from the client side. File attached. Lots of 500 Internal Server Errors, so this looks like an lfs-test-server issue the more I dig.
lfs-test-server_failure.txt

from lfs-test-server.

bk2204 avatar bk2204 commented on June 2, 2024

Yeah, as you've noticed, the "Fatal error: Server error:" response indicates a 500 error. I'm not sure why the test server is producing that in this case, but it does seem to be an issue there. I'm afraid that the test server doesn't get a huge amount of attention, but that could be either because it generally works or nobody's using it, I'm not sure which.

I'm going to transfer this issue over to that repository.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

Great, thanks for moving it for me and the info. I'll be digging into lfs-test-server. I think there is probably significant underestimated demand for better versioning of large files, and the lfs-test-server running locally is an attractive and obvious answer. Granted, if it doesn't actually work, I guess it's not such a great answer ;-)

from lfs-test-server.

bk2204 avatar bk2204 commented on June 2, 2024

I think the most important thing to look into is what error message you're getting back from the server. That will probably tell us a lot about what problem is occurring server side.

It may be helpful to try GIT_TRANSFER_TRACE=1 if you haven't already. That's turned on automatically by GIT_CURL_VERBOSE in newer versions (I think 2.6.0 and newer), but not in older versions, which don't know about GIT_CURL_VERBOSE.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

So in the lfs-test-server-failure.txt, the gist is that the PUT for the giant file is processed. Progress is shown and the full file size is reached. But then the server responds with a 500 instead of the usual 200. Then the client repeatedly tries again and each retry is immediately failed until the client gives up.
Time to learn the minimum go required to debug this and dig into the lfs-test-server. Any tips for getting any debug output from the test server?

from lfs-test-server.

bk2204 avatar bk2204 commented on June 2, 2024

I believe you'd probably want to instrument https://github.com/git-lfs/lfs-test-server/blob/master/server.go#L357 and have it print the error message it's producing to standard error. That would look like so (you'll need to import os at the top):

fmt.Fprintf(os.Stderr, `{"message":"%s"}`, err)

If you can determine what that error message is, that would tell you where the problem is occurring.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

Great, thanks. An interesting tidbit on the server side now is that it prints out each request and status. It thinks all of them are succeeding. There are a lot of bunched object /objects/batch requests after the PUT. But the logs show status=200 on the server side. Will see what I can get with the instrumentation you suggested.
Also, I tried both lfs.concurrenttransfers=1 and lfs.basictransfersonly=1 just on some wild hunches, but no change in results.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

So the PutHandler is exiting normally. So the reason for the failed connection is somewhere else. I'll keep digging. I'm also on a super old version of go (18.04LTS by default gets you 1.6.2) so I'll probably update that to the latest for good measure first.

from lfs-test-server.

bk2204 avatar bk2204 commented on June 2, 2024

I don't know if you intended to write "18.04" or "16.04", but 18.04 does have the golang-1.10 package, which should be sufficient to build and test the LFS server. It should also be present in xenial-backports, if you're using 16.04.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

Oops, sorry about that. 16.04. I'll grab the backport.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

Getting closer. When an LFS push is normal and succeeds on the first time, the sequence of events is:

Client: Batch Operation Request to get URL for uploading file
Server: Batch Operation Response with upload URL
Client: PUT Request to upload URL
Server: PUT Response

When the LFS push is for a big file that fails, the sequence is:

Client: Batch Operation Request to get URL for uploading file
Server: Batch Operation Response with upload URL
Client: PUT Request to upload URL
Client: Batch Operation Request to get URL for uploading file
Server: Batch Operation Response with upload URL
Client: PUT Request to upload URL (which fails)

So for some reason, git-lfs issues multiple BatchOperation requests to upload the file if the upload of the file takes "too long", and repeatedly tries to upload it again and fails. When the second push succeeds, the client/server messaging is the same until the second batch operation request to get URL for uploading file; at that point, the response actually contains a download URL and no upload URL, so a second PUT does not occur.

I'll keep digging, but even if this is an lfs-test-server issue, it seems bizarre that the git-lfs client issues all of these BatchOperation requests to try to upload the file again BEFORE the response for the initial PUT is even received.

from lfs-test-server.

bk2204 avatar bk2204 commented on June 2, 2024

When an upload fails, it can be for a number of reasons, one of which is that the authentication credentials expired. When we retry a request, we issue a new batch request, which will provide us with a new set of credentials if required. (Even if the reason wasn't that the credentials expired, they might expire soon, and we'd want fresh ones to maximize our success potential.)

We do this asynchronously (in a goroutine), so there's probably some reason that we're finding the initial PUT request is failing, perhaps a timeout of some sort.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

Got it, thanks. So turning the retries down to 1 with lfs.transfer.maxretries=1 eliminated a lot of the noise and makes things a bit clearer. Just like you said, the PUT request is failing for some reason. The lfs-test-server thinks it has succeeded from its side of logging, but the client shows no response from the initial put request (the 500 errors are for the subsequent put attempts).

Whatever the failure is, it is at the very end. When the second push succeeds, it is quite wonky. There is no response listed on the client side to the PUT, but the subsequent batch operation doesn't result in another put attempt, so there is no error message... I think I'm going to put wireshark between the two and try to get a clearer picture of who is in the wrong.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

Got wireshark going and understand the problem now.

At about 90 seconds, the last bytes of the file are sent from the client. At 120 seconds, the BatchRequest is sent, and at 150 seconds, the response to the PUT is sent, but the client is no longer listening for it.

There is a delay between when the last file bytes are sent and when the lfs-server has fully processed them and placed the file on disk and a response is sent. For our lfs-test-server, the lfs storage is on a spinning-disk NAS. When all is said and done, this delay is about 60 seconds for our 9GB file.

This is longer than the default lfs.activitytimeout of 30 seconds. Bumping it to 120 resolves the problem for the 9GB file. Setting to 0 (unlimited) would resolve for any size.

I'll look into Monday if there is anything simple that could be done in the lfs-test-server to avoid the long delay where no network traffic is present.

from lfs-test-server.

bk2204 avatar bk2204 commented on June 2, 2024

Yeah, that makes a lot of sense. We're probably timing out due to the close, which we need to be sure to do so that it's on disk properly. I don't know of anything that we could do differently, but if you think of anything, I'm definitely open to suggestions.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

Was typing up my findings when you responded with what I would find. :-)

Spent a bit of time investigating this, and I'm not sure there is a better solution than turning the timeout up or to 0. The big hang with no network activity between the lfs client and server is indeed on the file close. The close has to happen, and there's no easy way to get around it. I could artificially throttle the file write (knowing that it is to a NAS) which would decrease the time the close takes, but there's not a generic way to do that that would make sense for lfs-test-server users. Maybe there are things on my NAS that I could tweak to have it slow uploads to it down to a rate it can keep up with better.

I think there's enough here to close as not a bug. For our use case, I'll think about whether a NAS for a content store makes sense.

Thanks @bk2204 for all the help troubleshooting and debugging this. Really appreciated. :-)

from lfs-test-server.

bk2204 avatar bk2204 commented on June 2, 2024

You're very welcome. I'm glad we got to the bottom of your problem, even if it wasn't as easily solved as I'd hoped.

from lfs-test-server.

aggieNick02 avatar aggieNick02 commented on June 2, 2024

I'm sure a lot of that had to do with my newness to both go and the lfs-test-server more than anything else. Now I'm better prepared. :-)

from lfs-test-server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.