ethereum / utp Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 12.0 133 KB

uTorrent transport protocol

License: MIT License

Rust 100.00%

utp's People

Contributors

Stargazers

Watchers

Forkers

perama-v njgheorghita emhane kolbyml carver gameofpointers mrferris laptp5

utp's Issues

UTP Interop

More of a comment than an actual issue... But I think this could be of value to those of us who are debugging this library atm.

I wanted to confirm my suspicion as to what was causing the difference in performance when running glados against fluffy / trin. For context, running glados against fluffy resulted in drastically improved performance when compared to running glados against trin. It appeared as though fluffy was having no trouble with utp transfers, while trin was struggling. Since trin-bridge is the only bridge running, the conclusion was that utp txs for trin -> fluffy were working just fine, whereas trin -> trin was broken (for bodies & receipts ... aka any significantly long utp tx).

Results from running trin-bridge in backfill mode for 15mins. trin-bridge is run in isolation against a single client. Each utp tx is logged and marked as success & failure.

trin-bridge -> trin

trin-bridge -> fluffy

For whatever reason, it appears as though trin -> fluffy utp txs work flawlessly. Whereas trin -> trin utp txs are problematic. Hopefully this can help shed some insight as to where the bugs might reside.

Fix timestamp_difference_microseconds, it is always sending 1000000 microseconds even on long transfers

log: 0x5558be365860 127.0.0.1:9077 057733 Got ST_DATA. seq_nr:17517 ack_nr:17766 state:CONNECTED timestamp:4289278524 reply_micro:1000000
log: 0x5558be365860 127.0.0.1:9077 057733 acks:0 acked_bytes:0 seq_nr:17767 cur_window:0 cur_window_packets:0 relative_seqnr:0 max_window:1382 min_rtt:2783138807 rtt:0
log: 0x5558be365860 127.0.0.1:9077 057733 fast_resend_seq_nr:17767
log: 0x5558be365860 127.0.0.1:9077 057733 acks:0 acked_bytes:0 seq_nr:17767 cur_window:0 cur_window_packets:0 
log: 0x5558be365860 127.0.0.1:9077 057733 Got Data len:960 (rb:0)

timestamp_difference_microseconds or reply_micro:1000000 is always 1 second when sending ethereum/utp to bittorrent/libutp

log: 0x55e16d23f860 127.0.0.1:9077 051863 Got ST_DATA. seq_nr:20615 ack_nr:17766 state:CONNECTED timestamp:1513208380 reply_micro:4221
log: 0x55e16d23f860 127.0.0.1:9077 051863 acks:0 acked_bytes:0 seq_nr:17767 cur_window:0 cur_window_packets:0 relative_seqnr:0 max_window:1382 min_rtt:2783138807 rtt:0
log: 0x55e16d23f860 127.0.0.1:9077 051863 fast_resend_seq_nr:17767
log: 0x55e16d23f860 127.0.0.1:9077 051863 acks:0 acked_bytes:0 seq_nr:17767 cur_window:0 cur_window_packets:0 
log: 0x55e16d23f860 127.0.0.1:9077 051863 Got Data len:449 (rb:0)

This ^ is fluffy-utp to bittorrent/libutp which is sending accurate timestamp_difference_microseconds

We noted before this happened on small transfers, this issue is to confirm that the issue affects every data packet send by ethereum/uTP which this would cause our congestion protocol to not work properly?

To my knowledge it doesn't matter if the numbers are werid as long as they display the relative different on the local system. Which we currently not doing

Document public API and working examples

Complete documentation of public API. All functions with visibility identifier pub should be documented with rust docs. Example in README should be replaced by a working example. Having a folder with examples that can actually be executed is nice, though even better are async integration tests that are then configured to execute in CI, additional to the existing sync state transition tests. This issue replaces #47.

add tracing for congestion stats

add (optional) tracing for congestion control stats (e.g. send/receive window)

Congestion control timeout grows unbounded

@jacobkaufmann Just leaving this here to see if it strikes any inspiration...

There seems to be a case where the congestion control timeout grows without bounds...

In this case, i'm printing out sqnum & timeout on https://github.com/ethereum/utp/blob/master/src/conn.rs#L1088 - which results in the following panic

sqnum: 9674, timeout: 67108864s
sqnum: 9675, timeout: 134217728s
thread 'tokio-runtime-worker' panicked at 'invalid deadline; err=Invalid',
/Users/njg/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-util-0.6.10/src/time/delay_queue.rs:

In the tokio-util docs for insert_at(), it says... "This function panics if "when" is too far in the future."

My sense is that there's some edge case which causes the timeout to grow unbounded, which I'll set about hunting down. But if you have any sense as to where the problem might be, I'd be interested to hear it.

Naked Unwraps

I don't have a reproducible issue that I can show, but I can conclusively say that as I go about debugging this library I do run into panicked threads due to naked unwraps. In trin we took a firm stance against the use of naked unwraps. (According to clippy) there are 31 naked unwraps in utp. Some of which have documentation explaining their use. Some of which don't contain documentation. There are cases where naked unwraps are used in methods that do not return a Result which complicates resolving these unwraps.

I guess I'm just curious about the decision to allow naked unwraps in this library. Why? What was the criteria for using an unwrap vs implementing error handling for individual cases? It seems to me like resolving these unwraps would improve the performance of this library, but would like to get some confirmation before undertaking the task.

Under construction banner

This crate is still not safe to recommend for production use, and this should be made known to other users in the README. Some banner stating 'under construction, experimental use' is common practice for this case.

More tests in tests/socket, with explicit dropped packets

One of the important effects of running the high-concurrency tests is that (in some environments) they trigger packet loss. Seeing how utp responds in this situation is important.

We might be able to get more consistent results, faster tests, and/or new discoveries, by having more explicit packet-dropping patterns. The basic idea would be to use a socket layer where we control packet-loss according to some repeatable pattern. (ie~ Write a wrapper UdpLossy around UdpSocket and use UtpSocket::with_socket(lossy_udp) instead of UtpSocket::bind().

Some possible options for packet-loss patterns:

drop every Nth packet, for example, N=3: SEND, SEND, DROP, SEND, SEND, DROP
flip-flop between drop N & send N, for example N=2: SEND, SEND, DROP, DROP, SEND, SEND, DROP, DROP

Naturally, test a bunch of different N. Also, try various combinations of these on only inbound, only outbound, and both.

Full send_buf prevents sending of FIN

I've spent some time playing around with this library. Basically, I've been running a bare simple app locally that creates a sender & receiver utp socket and tries to tx 100000 bytes. This transfer has failed consistently - while it might be user error, i'm fairly confident I'm using the library in the way it's intended...

The repeated behavior I'm seeing is the sender / initiator processes the tx succesfully and closes the stream - however the recipient never fully processes the stream (via read_to_eof) and hangs until it time outs.

It appears to me as though a fin is never sent from the sender -> receipient. This is because the sender never clears its send_buf. The evaluations here and here always fail since send_buf is never emptied. So the fin is never sent. Something appears to be happening here where the send_buf is not properly cleared.

I still have some understanding to improve upon, so I'm not sure if these thoughts make any sense.

i haven't grokked the supposed behavior b/w send_buf and pending_writes
- while the pending_writes queue seems to empty correctly, the send_buf never fully clears... porque?
why is send_buf a VecDeque<Vec<u8>> and not just VecDeque<u8>>?

But from what I can tell from experimenting with this library, it's consistent behavior that's preventing the sending of the fin packet. Mostly just leaving this here so that I can try and organize my thoughts & i'll pick it back up after the weekend, but if anybody has any clues / ideas / pointers on what I'm misunderstanding - that's definitely appreciated.

Bug Tracking

This is just a issue for keeping track of bugs found, whoever has access perms can edit this.

2^16 -1 packets bug. We are currently unable to send more then that. A PR replicating the problem #60 and a issue comment address why it is here #44 (comment)
deadlock/halt bug on data bigger then buffer limit. information about it can be found here #61
- logical add/sub overflow crash bug

utp/src/send.rs

Line 30 in a2b619d

/// Returns the number of bytes available in the buffer.

. This is quick to fix T + offset - pending
sent acks are 1-1 for acknowledging data packets instead of 1-* which uses less bandwidth

This list will/maybe be updated and is just a reference of so far bugs that aren't considered priority, but should be fixed at one point or another. There are a few other issues that people are investigating but those are currently being researched

ethereum/utp shutdown behavior

What do we want to do with our shutdown behavior

Currently if all of your outgoing buffer is sent we just instantly send a fin.

Compared to libutp where it doesn't try to send a fin packet, it will send keep alive ack packets and lets the user decide, when we are finished

Change shutdown() to block until the write data is sent & acked

Callers (like trin) are interested in whether the written data has been sent (and ACK'd by the receiver). One option would be to expose a new flush() method that waits until the buffer is empty. It adds new complexity in a write/flush/write scenario that is not of interest (to trin at least), so let's just fold it into 'shutdown()' and not add a new API.

Instead, the new behavior would be that shutdown() will wait to return until all previously-written data has been sent and acknowledged (or, of course, will return early if the connection exits due to a failure). Currently, shutdown returns optimistically, then the connection tries to wrap up in the background.

One disadvantage of this approach is that the shutdown might report an error while handling the "wrap-up" steps, or if the peer disconnected between flushing all data and our attempt to shutdown. But that ought to be called a successful flush, from trins point of view. So we might have to carefully parse the different kinds of error coming out of a shutdown (or add new types of error). Using a new flush() method instead would mean we could return successfully and immediately if our peer had previously ACKd all data, even if they already hung up. That would be convenient and reliable for trin.

add socket-level config

add socket-level configuration to constructor(s)

the config could include the following:

max UDP packet size
max connections

Test utp against utplib

Improve README guide & document public API

A lot of the difficulty I'm having with this library is that it's difficult to understand how to use it. The example in the README is somewhat pseudocode. It doesn't compile, and uses types that are not valid according to the compiler (eg. CustomSocket / let remote = SocketAddr::from(..); / what is udp_socket?). This makes it a challenging resource to rely upon while debugging.

A lot of this knowledge resides in @jacobkaufmann 's head and it would be useful if you could improve the documentation to explicitly & correctly show the public api for this library.

I understand that the README example contains enough context that I should be able to interpret the public api, but explicit, correct examples are extremely helpful as a reference. Simple examples for a utp tx using both connect() and connect_with_cid() would be very helpful.

As an aside... I'm also curious as to why there are both connect() & connect_with_cid() api's available? What are the use cases for each method? Is using one preferable to the other?

Bug: stream needs to be spawned in new thread

The connect method on UtpSocket<P> fails when called in the same thread that we bound to the UtpSocket.
This code with a stream created in tokio::spawn ran in stream.rs passes

#[cfg(test)]
mod test {
    use crate::conn::ConnectionConfig;
    use crate::socket::UtpSocket;
    use std::net::SocketAddr;
    #[tokio::test]
    async fn test_start_stream() {
        // set-up test
        let sender_addr = SocketAddr::from(([127, 0, 0, 1], 3400));
        let receiver_address = SocketAddr::from(([127, 0, 0, 1], 3401));

        let config = ConnectionConfig::default();
        let sender = UtpSocket::bind(sender_addr).await.unwrap();

        tokio::spawn(async move {
            let mut tx_stream = sender.connect(receiver_address, config).await.unwrap();
            // write 100k bytes data to the remote peer over the stream.
            let data = vec![0xef; 100_000];
            tx_stream
                .write(data.as_slice())
                .await
                .expect("Should send 100k bytes");
        });
    }
}

and this code creating a stream in the same thread fails with error thread 'stream::test::test_start_stream panicked at 'called Result::unwrap() on an Err value: Kind(TimedOut)', src/stream.rs:137:76, on my branch improving error handling the error is more helpful thread 'stream::test::test_start_stream panicked at 'called Result::unwrap() on an Err value: ConnInit(MaxConnAttempts)', src/stream.rs:146:76.

#[cfg(test)]
mod test {
    use crate::conn::ConnectionConfig;
    use crate::socket::UtpSocket;
    use std::net::SocketAddr;
    #[tokio::test]
    async fn test_start_stream() {
        // set-up test
        let sender_addr = SocketAddr::from(([127, 0, 0, 1], 3400));
        let receiver_address = SocketAddr::from(([127, 0, 0, 1], 3401));

        let config = ConnectionConfig::default();
        let sender = UtpSocket::bind(sender_addr).await.unwrap();

        let mut tx_stream = sender.connect(receiver_address, config).await.unwrap();
        // write 100k bytes data to the remote peer over the stream.
        let data = vec![0xef; 100_000];
        tx_stream
            .write(data.as_slice())
            .await
            .expect("Should send 100k bytes");
    }
}

The reason for why this is the case needs to be identified, it might be the root of several bugs. It could be that the explanation is straight forward and then the thread spawning code should simply be abstracted away into the connect function since connect is part of the public API and this peculiar requirement is not documented.