Coder Social home page Coder Social logo

Comments (6)

neilalexander avatar neilalexander commented on May 20, 2024

I am curious whether the backpressure queue changes would have potentially resolved this in part - once a connection starts to block, I would expect that the queue would grow rather quickly, triggering a fairly fast route-around onto another link?

(Once the peer connection has timed out I assume that the queue is discarded rather than drained into another queue? I don't know if we might want to examine that.)

from yggdrasil-go.

Arceliar avatar Arceliar commented on May 20, 2024

Regarding your second point, there's no per-peer queue to drain other than whatever the OS manages as part of the TCP buffer, as of that last major set of backpressure changes. Per destination queues are kept in the switch, and are fed do peer connections as those connections become available.

And you're right about the intended behavior, what should happen is that a net.Conn.Write() call will block, causing it to never inform the switch that it's idle and ready for more traffic. So the switch sould route packets to another interface. What I suspect really happens is this:

  1. A stream of (TCP) traffic on some path has a link die, lets say it's A->B->C->D and link B->C die.
  2. The stream keeps sending, and fills up the buffer (or window? I'm not sure which) for B->C, which causes the last send to block.
  3. Because B->C is blocked, the switch directs traffic to follow B->E instead, and we end up with something like A->B->E->D.
  4. D tries to send (TCP) acknowledgements back, which tries to take D->C->B->A
  5. The C->B side of the link isn't blocked yet, and acks are small, so they don't fill up the buffer quickly.
  6. Since the acks are stuck, A assumes D hasn't gotten anything, and throttles down hard.
  7. Eventually, B->C times out and get disconnected, so TCP buffers can no longer eat packets.
  8. At some point after that, when the the traffic stream's exponential backoff has worn out, A finally tries to send to D again and the normal flow resumes for the path A->B->E->D with acks successfully coming back via D->E->B->A.

More generally, any stream of traffic which fails to fill TCP buffers quickly enough, and backs off in response to packet loss, will stall. Something like VoIP may freeze until enough traffic has been sent to fill TCP buffers in both directions.

This needs testing to confirm if it's really the problem, but I'm traveling right now, so I don't have equipment on hand to try that myself. If that really is the problem, then I'm not sure how I want to work around it. The easy fix would be to add some kind of application-level ack to ygg, which we'd need anyway if we ever redo the UDP layer, and then do something if we don't get acks back from a peer--maybe the connection doesn't tell the switch that it'd idle if it hasn't received any traffic at all recently. We unfortunately can't just check if the TCP stream has gotten an ack, as far as I can tell.

from yggdrasil-go.

Arceliar avatar Arceliar commented on May 20, 2024

Marking for v0.3. I don't know if we can fix it (without requiring a lot of acknowledgments and extra book-keeping), but we should try to make sure we understand what exactly causes this bug, and take care of any low-hanging-fruit as far as fixes/workarounds/partial-mitigation goes.

from yggdrasil-go.

Arceliar avatar Arceliar commented on May 20, 2024

I think I know of a way to work around this.

  1. Just before writing a packet, set a read timeout for a few seconds later.
  2. When reading a packet, start a timer (if it's currently stopped), which should be set to fire after less time than the above read timeout.
  3. When writing a packet, cancel that timer (when doing step 1, just before sending the packet).
  4. If the timer fires, send back a 0-length keep-alive packet to acknowledge that we're still here.

Currently, we send 0-length keep-alive packets about once every 4 seconds (if we have no other traffic to send), and time out after a user-configurable amount of time that defaults to (and can't be set to less than) 6 seconds. A side effect of this change is that we could drop that otherwise-pointless idle keep-alive traffic.

EDIT: It doesn't really fix the problem that we have to wait for a timeout, but we may be able to tune the timeout to less than it is now. Currently, the 6 second timeout is to account for the 4 second keep-alive interval, + a little wiggle room. We could maybe switch to something like a 1 second timer before firing keep-alive traffic, and a 2 or 3 second minimum (and default) read timeout.

from yggdrasil-go.

neilalexander avatar neilalexander commented on May 20, 2024

Is this still a problem with the changes in v0.3.3, or the proposed changes in v0.3.4?

from yggdrasil-go.

Arceliar avatar Arceliar commented on May 20, 2024

If it's working the way it's supposed to, then by v0.3.4 (v0.3.3+fixes) it should be routing around these things after a few seconds, instead of waiting for a timeout. Except probably in cases where the timed out link leads to a parent. But I need to test this.

from yggdrasil-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.