Comments (6)
I am curious whether the backpressure queue changes would have potentially resolved this in part - once a connection starts to block, I would expect that the queue would grow rather quickly, triggering a fairly fast route-around onto another link?
(Once the peer connection has timed out I assume that the queue is discarded rather than drained into another queue? I don't know if we might want to examine that.)
from yggdrasil-go.
Regarding your second point, there's no per-peer queue to drain other than whatever the OS manages as part of the TCP buffer, as of that last major set of backpressure changes. Per destination queues are kept in the switch, and are fed do peer connections as those connections become available.
And you're right about the intended behavior, what should happen is that a net.Conn.Write()
call will block, causing it to never inform the switch that it's idle and ready for more traffic. So the switch sould route packets to another interface. What I suspect really happens is this:
- A stream of (TCP) traffic on some path has a link die, lets say it's A->B->C->D and link B->C die.
- The stream keeps sending, and fills up the buffer (or window? I'm not sure which) for B->C, which causes the last send to block.
- Because B->C is blocked, the switch directs traffic to follow B->E instead, and we end up with something like A->B->E->D.
- D tries to send (TCP) acknowledgements back, which tries to take D->C->B->A
- The C->B side of the link isn't blocked yet, and acks are small, so they don't fill up the buffer quickly.
- Since the acks are stuck, A assumes D hasn't gotten anything, and throttles down hard.
- Eventually, B->C times out and get disconnected, so TCP buffers can no longer eat packets.
- At some point after that, when the the traffic stream's exponential backoff has worn out, A finally tries to send to D again and the normal flow resumes for the path A->B->E->D with acks successfully coming back via D->E->B->A.
More generally, any stream of traffic which fails to fill TCP buffers quickly enough, and backs off in response to packet loss, will stall. Something like VoIP may freeze until enough traffic has been sent to fill TCP buffers in both directions.
This needs testing to confirm if it's really the problem, but I'm traveling right now, so I don't have equipment on hand to try that myself. If that really is the problem, then I'm not sure how I want to work around it. The easy fix would be to add some kind of application-level ack to ygg, which we'd need anyway if we ever redo the UDP layer, and then do something if we don't get acks back from a peer--maybe the connection doesn't tell the switch that it'd idle if it hasn't received any traffic at all recently. We unfortunately can't just check if the TCP stream has gotten an ack, as far as I can tell.
from yggdrasil-go.
Marking for v0.3. I don't know if we can fix it (without requiring a lot of acknowledgments and extra book-keeping), but we should try to make sure we understand what exactly causes this bug, and take care of any low-hanging-fruit as far as fixes/workarounds/partial-mitigation goes.
from yggdrasil-go.
I think I know of a way to work around this.
- Just before writing a packet, set a read timeout for a few seconds later.
- When reading a packet, start a timer (if it's currently stopped), which should be set to fire after less time than the above read timeout.
- When writing a packet, cancel that timer (when doing step 1, just before sending the packet).
- If the timer fires, send back a 0-length keep-alive packet to acknowledge that we're still here.
Currently, we send 0-length keep-alive packets about once every 4 seconds (if we have no other traffic to send), and time out after a user-configurable amount of time that defaults to (and can't be set to less than) 6 seconds. A side effect of this change is that we could drop that otherwise-pointless idle keep-alive traffic.
EDIT: It doesn't really fix the problem that we have to wait for a timeout, but we may be able to tune the timeout to less than it is now. Currently, the 6 second timeout is to account for the 4 second keep-alive interval, + a little wiggle room. We could maybe switch to something like a 1 second timer before firing keep-alive traffic, and a 2 or 3 second minimum (and default) read timeout.
from yggdrasil-go.
Is this still a problem with the changes in v0.3.3, or the proposed changes in v0.3.4?
from yggdrasil-go.
If it's working the way it's supposed to, then by v0.3.4 (v0.3.3+fixes) it should be routing around these things after a few seconds, instead of waiting for a timeout. Except probably in cases where the timed out link leads to a parent. But I need to test this.
from yggdrasil-go.
Related Issues (20)
- lack of result checking for hex.DecodeString in many places.
- Panic if the key length in debug_ requests exceeds the maximum.
- Yggdrasil sites are not opening. OpenWrt HOT 4
- Allow a configurable maximum backoff
- panic: this should never happen HOT 2
- Performance decrease in 0.5+ HOT 3
- It seems that the priority is not working HOT 19
- Crash Win11 HOT 5
- -autoconf require access to admin socket and tun interface HOT 3
- Strange speed behavior after the first minute. HOT 4
- Feature for specifying DNS servers addresses for TUN from configuration file HOT 2
- URL decoding issue with link-local addresses HOT 9
- Is possible using radio network? HOT 2
- RHEL and openSUSE HOT 6
- ssh client does not connect to the yggdrasil node. HOT 1
- Register 200::/7 with IANA and IESG? HOT 6
- yggdrasil changing its interface ipv6 every restart - how to make it static/fixed? HOT 4
- Doesn't work on openwrt router with mips (big endian) cpu HOT 1
- normaliseconf flag removes existing values
- Testflight expired & refusing new tester (again) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yggdrasil-go.