network-quality / draft-ietf-ippm-responsiveness Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 6.0 1.65 MB

Makefile 100.00%

draft-ietf-ippm-responsiveness's People

Contributors

Stargazers

Watchers

Forkers

richb-hanover mrdvt92 felixgaudin nicolas17 hawkinsw lpardue

draft-ietf-ippm-responsiveness's Issues

Make IETF-draft a Proposed Standard

We are currently "Experimental". We should change it to Standards track

Citations and references

We don't have any references. There is quite some related work. Need to make sure to cite those.

TCP and TLS handshakes are possibly under-defined

The spec says:

the test measures the time needed to , establish a TCP connection on port 443, establish a TLS context using TLS1.3 [RFC8446],

there's several variations possible here. Are we expecting clients to game this, or stick rigidly to some well-defined interactions? I'm mainly thinking about things like session resumption, or optimizing TLS handshakes in various ways like cert compression or short cert chains.

Include Discussion of Expected Operational Scenarios

From IPPM adoption call feedback:

There may be no bottleneck buffer - the test/metric needs to detect that and act appropriate.
One might be interested in the metric while the test adds an insignificant load only. While receiver tries to consume service X, what's the metric seen from Y?
(may help to clarify, whether the access causes trouble or another network section).

How do you build kramdown?

On macOS (10.15.7), I tried to use brew, but there doesn't seem to be a built version.

As I noted elsewhere, I found https://xml2rfc.tools.ietf.org/, but this is a two-step process, which isn't easily automated into a make solution

Any role for the Server-Timing field?

Since we're using HTTP, you might want to think about whether there is anything that the Server-Timing field could help measure; see https://w3c.github.io/server-timing/#the-server-timing-header-field. The field is extensible, so the action here is to consider if there are optional metrics that responsiveness servers might be able to expose for some purpose, and document that here.

HTTP/2 Request and response prioritization

This is most relevant when probe requests are made on existing connections. It's touched upon in the text

At the HTTP/2 layer it is important that the load-generating data is not interfering with the latency-measuring probes. For example, the different streams should not be stacked one after the other but rather be allowed to be multiplexed for optimal latency.

But I think you probably need to say more about this as other HTTP/2 implementations come into the mix.

Minor: Missing period in list item

Should not waste traffic, since the person may be paying for it

Patch forthcoming!

Flaw in “Working Conditions” algorithm

I believe we have a flaw in the “Working Conditions” algorithm.

This is the challenge of congestion control (i.e., rate adaptation):

As the sender increases the amount of data in flight, the goodput goes up until the BDP of the pipe is filled. Call this point A on the flightsize graph. This is the ideal target operating point.
As the sender continues to increase the amount of data in flight, the goodput remains flat, but the excess data sits in queues, causing round-trip delay to go up, until a queue overflows and a packet is lost. Call this point B on the flightsize graph. This is the operating point for Reno or CUBIC.

By having our test add TCP flows until the goodput stabilizes, we are inadvertently seeking operating point A. In a sense, we have re-created a crude version of BBR. That is not what Reno or CUBIC do. They keep pushing until they find point B.

The result is that we may under-report the worse-case queueing delay.

If we want to find point B (maximum queue depth) we need to keep increasing traffic until the delay stops increasing, and not halt once the goodput stops increasing.

I propose a modification to the algorithm: keep adding TCP connections once per second, measuring the new-connection and in-connection round-trip delays as we go, and stop adding new TCP connections after we’ve experienced four consecutive seconds with no further increase in any of our four round-trip delay metrics beyond the respective maximums we’ve already measured. At that point, the respective maximums we’re recorded are the values we use to report the result.

Use well-known URIs.

Feedback from Jana Iyengar.

Add section on impact of implementation details and how to separate client/server from network bufferbloat

Feedback from Greg Mirsky (@GregMirsky):

Hi Christoph,
thank you for putting your thought into my comments. Your understanding is absolutely correct and the new section, as you've outlined it, would make the document even more useful to a reader. With that plan in place, I support the adoption of the draft and would help with review and comments.

Regards,
Greg

On Tue, Feb 15, 2022 at 4:11 PM Christoph Paasch <[email protected]> wrote:
Hello Greg,

On Feb 8, 2022, at 1:10 PM, Greg Mirsky <[email protected]> wrote:

Hi Christoph,
apologies for the belated response and thank you for sharing interesting details of you using the measurement method. I think that if the measurement method can not only provide the Round-trip Per Minute (RPM) metric but expose the network propagation and residential components of the round-trip delay, then it seems to me, the scope of the draft to be aligned with the charter of the IPPM WG and I'll be in favor of the WG adoption of the work.
What do you think? What is the opinion of the authors and the WG?

I am assuming that with "residential components" you mean the server/client-side contribution to the measured latency, right?

In that case, yes the method does allow to separate these, as latency-probes are sent on both the load-generating connections and on separate connections. The difference between the two represents the "server-side contribution" to the latency.

I think what would be helpful would be a section in the draft that explains the different sources of latency (network, server, client) and how they affect the final RPM-number and how one can separate out these two components. It is also important to understand that the results are highly implementation-dependent. And explaining that in this section should help, I believe.

Would that be in line with what you are looking for?

Thanks,
Christoph

Define canonical ranges for RPM measurements

I hope that this is a valid question/comment. I am thankful for the work that you have done with this draft!

Along the numerical result of its calculation, the macOS networkQuality client "describes" the RPM measurement (e.g., Responsiveness: Medium (234 RPM). It seems like a good idea for this standard to define the mapping between RPM and adjective so that all users of the protocol can communicate their results consistently and without ambiguity.

I would assume that such a mapping exists within the source code for networkQuality and could be imported directly into the draft.

Again, I hope that this is valid and not a waste of your time!
Will

cc @cpaasch @randall

Define What We Are Measuring, Bufferbloat

From @GregMirsky on IPPM adoption call feedback:

as I understand it, the proposed new responsiveness metric is viewed as the single indicator of a bufferbloat condition in a network. As I recall, the discussion at Measuring Network Quality for End-Users workshop and on the mailing list indicated, that there’s no consensus on what behaviors, symptoms can reliably signal the bufferbloat. It seems that it would be reasonable to first define what is being measured, characterized by the responsiveness metric. Having a document that discusses and defines the bufferbloat would be great.

Add "RPM" to the draft

Feedback from Toerless Eckert.

Capture considerations for HTTP responses to large/infinite uploads

Some HTTP implementations might wait until an entire upload object is received before they will respond with any status code. That's not ideal, even if it doesn't really affect the purpose of the upload in saturating the link. You might want to address this point given you are recommending very large uploads.

RPM is not only about bufferbloat

Feedback from Erik Auerswald:

Expressing 'bufferbloat as a measure of "Round-trips per Minute" (RPM)'
exhibits (at least) two problems:

A high RPM value is associated with little bufferbloat problems.

A low RPM value may be caused by high minimum delay instead of
bufferbloat.

I think that RPM (i.e., under working conditions) measures a network's
usefulness for interactive applications, but not necessarily bufferbloat.
I do think that RPM is in itself more generally useful than minimum
latency or bandwidth.

A combination of low minimum latency with low RPM value strongly hints
at bufferbloat. Other combinations are less easily characterized.

Bufferbloat can still lie in hiding, e.g., when a link with bufferbloat
is not yet the bottleneck, or if the communications end-points are not
yet able to saturate the network inbetween. Thus high bandwidth can
result in high RPM values despite (hidden) bufferbloat.

The "Measuring is Hard" section mentions additional complications.

All in all, I do think that "measuring bufferbloat" and "measuring RPM"
should not be used synonymously. The I-D title clearly shows this:
RPM is measuring "Responsiveness under Working Conditions" which may be
affected by bufferbloat, among other potential factors, but is not in
itself bufferbloat.

Under the assumption that only a single value (performance score) is
considered, I do think that RPM is more generally useful than bandwidth
or idle latency.

On a meta-level, I think that the word "bufferbloat" is not used according
to a single self-consistent definition in the I-D.

We need to definitely change our wording here.

Consider documenting how cloud architectures might work in practice

There seems to be some inherent assumptions in the specification about where the test server(s) reside, and how clients communicate with them. These assumptions might not hold for certain kinds of scaling deployment (e.g. clouds or CDNs).

For example, it seems reasonable that a client making several independent HTTP/2 connections to the same authority would land on the same Internet path. However, load balancing after that system boundary could add a lot of undetectable timing variation to servicing of the connection and the requests within that connection.

The testing strategy seems resilent to these forms of architectural differences. But I wonder if we need to be clearer for client implementers on how the measurements could be affected, and whether the aggregation calculation needs to be a bit more advanced to accommodate such cases.

Consider putting server-side config in appendix

Feedback from Bjorn Mork:

We can put server-side configs in the appendix, similar to https://datatracker.ietf.org/doc/html/rfc8806

Need to specify what happens if one of the load-generating connections fail

Comment from @hawkinsw . We need to specify what to do when a connection fails.

It's not 100% clear what "Saturation" is

"4.1.3. Reaching saturation" says "Saturation means not only that the load-bearing connections are utilizing all the capacity, but also that the buffers are completely filled".
While "saturation" == "buffers are completely filled" is a clear definition, it relies on something basically impossible to achieve in practice: completely (like 100%, not 99.999999%) filled. Thus making it clear but useful only in theory.

As 4.1.2 says, "loss-based TCP congestion control algorithms aggressively reacts to packet-loss by reducing the congestion window. This reaction will reduce the queuing in the network". If we accept that we can't "ensure that buffers are actually full (100%, not 99.999999%) for a sustained period of time" (2. Measuring is hard), then we lose the clear "saturation" == "buffers are completely filled" definition and need something less simple, but usable in practice.

The description of "4.1.4. Final algorithm" says:

The algorithm takes into account that throughput gradually increases as TCP connections go through their TCP slow-start phase. Throughput-increase eventually stalls for a constant number of TCP-connections - usually due to receive-window limitations. At that point, the only means to further increase throughput is by adding more TCP connections to the pool of load-bearing connections. This will then either result in a further increase in throughput, or throughput will remain stable. In the latter case, this means that saturation has been reached and - more importantly - is stable.

But the latter part, "In the latter case, this means that saturation has been reached and - more importantly - is stable", is not actually true if "saturation" is defined as literally 100% full. Actually it doesn't need to be true even if you define it as only 17% full.

For example, let's suppose a 100 MiB bottleneck buffer in a 1Mbps, 1ms RTT, path.
4.1.2 mentions "TCP window-size constraints of 4MB". So following the "Final algorithm" from 4.1.4 we would:

Create 4 load-bearing connections, with a total windows of 4 x 4 MB = 16 MB, more than enough to fill the 1Mbps.
After 4 seconds of computing the moving average we would be measuring 1 Mbps. There would be no losses because the queue would never be filled (no more than 16 MiB can be on it), so it would be a perfect 1 Mbps.
On second 8 we would be still measuring 1 Mbps. So we would declare "stable saturation", even when the buffer is more empty than full.

But the abstract section says we want to measure "bufferbloat the way common users are experiencing it today". So at some point we need to ask ourselves how much we really want that "the buffers are completely filled". Surely there is always going to be a bottleneck buffer that's sooooooo crazily oversized that no user is ever going to be using enough 4 MB window connections to fill it.

I don't have a perfect generic solution for this. It's true, measuring it is hard; even deciding what exactly should be measured is hard. But the current document doesn't seem to always agree with itself, and if I understood the final algorithm correctly it doesn't actually do what it says it does.

Clarify value of using timeless units to express responsiveness

From @GregMirsky on IPPM adoption call feedback:

Then, I find the motivation not to use time units to express the responsiveness metric not convincing:

   "Latency" is a poor measure of responsiveness, since it can be hard
   for the general public to understand.  The units are unfamiliar
   ("what is a millisecond?") and counterintuitive ("100 msec - that
   sounds good - it's only a tenth of a second!").

What happens if the conditions for saturation are not met within 20 seconds?

Hello again!

As I said before, I hope that this is a meaningful/helpful comment/question: What happens if the conditions for saturation are not met within 20 seconds?

While not likely, it is possible that the client cannot saturate a user's connection according to the algorithm (in Section 4.1) in a fixed amount of time. Because one of the goals of the standard is to create a metric that can be measured quickly (Our target is 20 seconds.), it seems like a good idea for the standard to a) specify a time limit for the test and b) specify what happens if that time limit is exceeded.

For the client I am writing, I have adopted this practice:

The client attempts to saturate the connection for a fixed period of time (user defined from the CLI with a default of 20 seconds) using the algorithm described in Section 4.1.
If the client cannot saturate the connection within that time, the connection is considered provisionally saturated and measurement of RPM proceeds as if the connection is actually saturated.
The client gets another 5 seconds (the user cannot change this value) to perform the series of probes described in Section 4.1 and calculate the RPM.

This, of course, is not to say that this is the best idea. Nor may it even be a good idea. I'm new to this whole process and it takes me a while to get my wheels spinning (get it?).

I hope this is helpful!
Will

cc @randall @cpaasch

How are the two directions aggregated

According to the current spec:

"Thus, we recommend testing uplink and downlink sequentially.
Parallel testing is considered a future extension."

Load is generated sequentially for both directions, which implies that there should be two sets of delay measurements. Section "4.2.1. Aggregating the Measurements" however does not tackle how to aggregate these measurements for the two directions.
Real live links can and do show latency-under-load increases that differ between the two directions, here is an example from a VDSL2 link, showing the gping results towards (the heavily anycasted) 8.8.8.8 during a speedtest that sequentially saturated both directions:
https://forum.openwrt.org/uploads/default/original/3X/8/0/804455ac122f5cf58cebba52c6d6286f6fae75ad.jpeg

It should be clear that the invoked delay differs noticeable between both parts of the speedtest and hence both would give noticeably different RPM results as well. Leading to the question whether averaging all of the would be the best aggregate here or averaging for both directions and then reporting the smaller of the two?

Or simply switch to measure during bi-directionally saturating load (potentially after first deducing the number of required flows per uni-directional tests and then using these numbers during the actual delay data collection step).

Add conversion-table for RPM to latency

Tol ease understanding, we should have a conversion-table that would make it more intuitive to the reader to understand what RPM really means. E.g.,:

Latency (ms)	Responsiveness (RPM)
2	30000
5	12000
10	6000
20	3000
50	1200
100	600

(suggested by Michael R. Davis)

RPM should not be considered the only metric

From Erik Auerswald:

On Wed, Aug 18, 2021 at 03:01:42PM -0700, Christoph Paasch wrote:

On 08/15/21 - 15:39, Erik Auerswald wrote:

[...]
I do not think RPM can replace all other metrics. This is, in a way,
mentioned in the introduction, where it is suggested to add RPM to
existing measurement platforms. As such I just want to point this out
more explicitely, but do not intend to diminish the RPM idea by this.
In short, I'd say it's complicated.

Yes, I fully agree that RPM is not the only metric. It is one among
many. If there is a sentiment in our document that sounds like "RPM
is the only that matters", please let me know where so we can reword
the text.

Regarding just this, in section 3 (Goals), item 3 (User-friendliness),
the I-D states that '[u]sers commonly look for a single "score" of their
performance.' This can lead to the impression that RPM is intended to
provide this single score.

I do think that RPM seems more generally useful than either idle latency
or maximum bandwidth, but for a more technically minded audience, all
three provide useful information to get an impression of the usefulness
of a network for different applications.

Add discussion about server-placement

Feedback from Toerless Eckert.

"send and receive a one-byte object with a HTTP/2 GET request"

This is ambigous. I'm hopeful you are not advocating for sending a GET with 1 byte of HTTP message content.

Make a second iteration on the Introduction

Need to make sure that the emphasis on measuring real user-experience based on how today's users are utilizing the Internet.

Clarify why we want TLSv1.3

Feedback from Bjorn Mork (https://lists.bufferbloat.net/pipermail/rpm/2022-March/000165.html)

We should explain why TLSv1.3 is required.

Clean up terminology and better explain tradeoffs

We are using a confusing mix of "working conditions", "saturation", "fully loaded", "busy as possible", "typical day-to-day pattern",... This needs to be cleaned up!

We should start probably with a description of what "working conditions" means. Why it is a near worst-case while still being realistic. What the tradeoffs here are. And, that it is going to evolve over time as typical day-to-day patterns are evolving.

Feedback from Al Morton (@acmacm):

On Dec 17, 2021, at 3:50 PM, MORTON JR., AL [email protected] wrote:

Hi authors and ippm-chairs,

Thanks for writing this-up!

I took one pass through, and have the following comments during Adoption call for draft-cpaasch-ippm-responsiveness:

TL;DR:
Many previously undefined terms were used here, and a more direct description using the term “saturation” seems possible, IMO.

I fully agree with you. We are not very good at describing the "working conditions"/"saturation" we are aiming for, why we use these and what the right approach is.

The type of "working conditions" is crucial as to what the measurement result will be
For example, flooding the network with UDP traffic will saturate the network pretty well, but it is far away from a realistic working condition.

What we are aiming for is near worst-case scenario while still being realistic. At least, that is the intention and it may be good to have an open discussion about this.

IPPM has used a template for metric drafts, and use of the hierarchy of singleton, sample, and statistic metrics from RFC 2330 will help with clarity/answer many of my questions.

regards (I’m off-line for a while now, so enjoy the holidays),
Al

From the Abstract:

This document specifies the "RPM Test" for measuring responsiveness.
It uses common protocols and mechanisms to measure user experience
especially when the network is fully loaded ("responsiveness under
working conditions".) The measurement is expressed as "Round-trips
Per Minute" (RPM) and should be included with throughput (up and
down) and idle latency as critical indicators of network quality.

“fully loaded” and “working conditions” aren’t necessarily the same, to me. I’ll be looking for better definitions.

Agree'd.

Goals

The algorithm described here defines an RPM Test that serves as a
good proxy for user experience. This means:

Today's Internet traffic primarily uses HTTP/2 over TLS. Thus,
the algorithm should use that protocol.

As a side note: other types of traffic are gaining in popularity
(HTTP/3) and/or are already being used widely (RTP).

There are many measurement stability challenges when TCP is involved, see section 4 of RFC8337: https://datatracker.ietf.org/doc/html/rfc8337#section-4
RFC8337 intentionally broke the TCP control loop to make measurements in the face of these challenges.

Yes, we are aware of these kind of stability challenges. And actually sometimes observe them as results can vary to some degree across different runs.

The goal is to get as close as possible to a stable measurement result, while still using the protocols the end-users use on a day-to-day basis.

4.1. Working Conditions

For the purpose of this methodology, typical "working conditions"
represent a state of the network in which the bottleneck node is
experiencing ingress and egress flows similar to those created by
humans in the typical day-to-day pattern.

While a single HTTP transaction might briefly put a network into
working conditions, making reliable measurements requires maintaining
the state over sufficient time.

The algorithm must also detect when the network is in a persistent
working condition, also called "saturation".

Desired properties of "working condition":

o Should not waste traffic, since the person may be paying for it

o Should finish within a short time to avoid impacting other people
on the same network, to avoid varying network conditions, and not
try the person's patience.

These seem like reasonable goals for the traffic that loads the network.
New terms needing definition were introduced:
“persistent working condition = saturation”,
which is different from
“ingress and egress flows similar to those created by humans in the typical day-to-day pattern”

Later in 4.1.1, terms like “saturate a path” and “fill the pipe” appear, and

The goal of the RPM Test is to keep the network as busy as possible
in a sustained and persistent way. It uses multiple TCP connections
and gradually adds more TCP flows until saturation is reached.

The terms “busy as possible”, and “typical day-to-day pattern”, or
“saturation” and “working conditions” indicate different load levels to me.

@@@@ Suggestion: I think it would help to simplify the terminology in this draft. You intend to measure a saturated path, so just say that. No “typical”, no “working conditions”, etc., in these early sections.

The sentence beginning “The goal...” should really appear in Section 3. Goals

Also, you have defined a measurement method in the sentence, “It uses...” above. This method of adding connections has been observed in other measurement systems, but it isn’t typical of user traffic, especially when each connection has an ~infinite amount of data to send during the test.

From your comments I see that we definitely need a longer explanation of the tradeoffs that are being made of measuring this near worst-case, but realistic scenario. As you correctly point out, we are mixing confusing and sometimes contradictory terms. This needs to be cleaned up.

Definition of content type

Please consider defining/recommending/citing the media type of HTTP message content aka the value in the "Content-Type" header. Under HTTP rules, a sender SHOULD set that.

For instance, the JSON config could use the "application/json" type. While the upload / download resources could be "application/octet-stream".

Calculating average probe measurement time

draft-ietf-ippm-responsiveness/draft-ietf-ippm-responsiveness.txt

Line 481 in 38f2b24

values. That is, it sums the five time values for each probe, and

In each probe there are five measurements (obviously this number is subject to change given #52 , etc). However, from the draft, it is not clear whether the "average probe duration" is

total = 0
repeat N times:
  total += (DNS Time + TCP Time + TLS Time + HTTP Time [unsat] + HTTP Time [sat])/5

average probe duration = total / N

total = 0
repeat N times:
  total += (DNS Time + TCP Time + TLS Time + HTTP Time [unsat] + HTTP Time [sat])

average probe duration = total / N

Minor edits for algorithm description, sequential vs parallel, "load-generating"

Feedback from Al Morton (@acmacm):

4.1.2. Parallel vs Sequential Uplink and Downlink

...
To measure responsiveness under working conditions, the algorithm
must saturate both directions.

Bi-directional saturation is really atypical of usage. I don’t think the benefit of “more data” pays off.

...

However, a number of caveats come with measuring in parallel:

o Half-duplex links may not permit simultaneous uplink and downlink
traffic. This means the test might not saturate both directions
at once.

o Debuggability of the results becomes harder: During parallel
measurement it is impossible to differentiate whether the observed
latency happens in the uplink or the downlink direction.

o Consequently, the test should have an option for sequential
testing.

@@@@ Suggestion: IMO, tests/results with Downlink saturation OR Uplink saturation would be more straightforward, and can be understood by users (especially those who have tested in the past). Avoid the pitfalls and make Sequential testing the preferred option.

I tend to agree with you. We can leave "Parallel" as an interesting extension to the test that can expose other types of characteristics of the network.

4.1.3. Reaching saturation

The RPM Test gradually increases the number of TCP connections and
measures "goodput" - the sum of actual data transferred across all
connections in a unit of time. When the goodput stops increasing, it
means that saturation has been reached.
...

Filling buffers at the bottleneck depends on the congestion control
deployed on the sender side. Congestion control algorithms like BBR
may reach high throughput without causing queueing because the
bandwidth detection portion of BBR effectively seeks the bottleneck
capacity.

RPM Test clients and servers should use loss-based congestion
controls like Cubic to fill queues reliably.

With the evolution of Congestion control algorithms seeking to avoid filling buffers, does it make sense to require a full buffer at the bottleneck to achieve saturation?
In fact, the definition above, “When the goodput stops increasing,...” does not require full buffers; it requires maximizing a delivery rate measurement instead.

The above paragraph on BBR vs Cubic should probably be changed. With our goal being to measure "realistic" usage patterns, the recommendation should be to use the congestion control that is currently most widely deployed. If the majority of the Internet switches to BBR, then that's what should be measured.

In 4.1.4, the final steps of the algorithm were not clear to me:
  *  Else, network reached saturation for the current flow count.
@@@@ This wording implies it to be the final step, but there are further conditions to test.
Maybe this step is “Else, Candidate for stable saturation”?

Sounds good!
     +  If new flows added and for 4 seconds the moving average
        throughput did not change: network reached stable saturation
@@@@ Maybe:
+ If the 4 second moving average of "instantaneous aggregate goodput" with no new
flows added did not change
(defined as: moving average = "previous" moving average +/- 5%),
then the network reached stable saturation

That's better!
     +  Else, add four more flows
@@@ ??? and return to start?

Yes, the entire thing is evaluated every 1-second interval. I will make that explicit.

Finally, in 4.1.4, the Note explains:

Note: It is tempting to envision an initial base RTT measurement and
adjust the intervals as a function of that RTT. However, experiments
have shown that this makes the saturation detection extremely
unstable in low RTT environments. In the situation where the
"unloaded" RTT is in the single-digit millisecond range, yet the
network's RTT increases under load to more than a hundred
milliseconds, the intervals become much too low to accurately drive
the algorithm.

Well, TCP senders/control-loops are involved here, and likely play a
role in behavior categorized as “difficult to measure”.

By the time we get to

4.2. Measuring Responsiveness

Once the network is in a consistent working conditions, the RPM Test
must "probe" the network multiple times to measure its
responsiveness.

Each RPM Test probe measures:

You previously started at least four TCP connections with infinitely large files.
The “create connection” RPM probes establish additional connections, DNS, TCP, etc.
Is each new connection an RPM probe? or is the set of connection tests a single probe?
(later we learn it is the set of
What if one of the set of connections fails/times-out?

I take it that the “load-bearing” connections are driving the path to saturation.
Maybe “load-generating connections” is more clear?

"load-generating" is indeed a better term!

Explain what different things an app can do when saturation is not reached within the time-limit

Feedback from Bjorn Mork:

Re-reading this, I realize that I went out to harsh here. Sorry.

I think it can be improved by replacing things like

"It is left to the implementation what to do when saturation is not
reached within that time-frame."

with a precise description of what to do.

There are two approaches here:

Either, the implementation aborts and errors out.
Or, the implementation nevertheless measures the responsiveness and either presents the result as a valid result or as a result with a low confidence score.

We can probably outline the options that an implementation has.

Fix formatings

Some formattings got messed up when transitioning to IETF-draft format.

Namely, the "Final algorithm" and the JSON in the protocol specification

Preventing DDoS

From IPPM adoption call feedback:

How will the servers handle the possibility of a DDoS?

Editorial changes

From Erik Auerswald:

unified-diff
editorial-suggestions-2021-08-15-unified.diff.txt

DNS client behaviour is under-defined

The spec says something along the lines of

The test measures the time needed to make a DNS request

that timing is going to be subject to things such as whether it's plain Do53, DoT, DoH, DoQ, etc etc. Whether caching is in effect, other DNS things I'm no expert in.

I think it would help to expand more on these points in the spec, otherwise you're subject to client implementation defaults or people with little understanding of these things not knowing what to do.

Change "tens of thousands of people"

Feedback from Dave Taht:

I wanted to offer a small correction to the current RPM abstract,
uploaded a few days ago:

https://www.ietf.org/id/draft-cpaasch-ippm-responsiveness-01.html

Millions. 3m at free alone had fq_codel on their DSL. comcast is..
however many docsis 3.1 modems have deployed (millions) ? eero and
everyone shipping qcom wifi chips is ? gfiber's deployment? the entire
3rd party firewall and router market (?) those are just the easier to
count numbers off the top of my head. Sure, in terms of postings and
individual interactions visible on the web in the latter case it
doesn't seem like a lot, but I figure the existing documentation and
user base is 1000x that....

so... millions.

If you want to also count in the upgrades in bandwidth in the last 10
years, another accomplishment, I think, was most of that bandwidth was
added without misguided increases in buffering, without our
fancy-schmancy algorithms needed, so that was many more millions. If
you want to think about server side, bbr, tsq, bql, packet pacing...
decreases.

So a small change in language perhaps?

"semi-solved for millions of people"?

Certainly wifi and lte suck the most of what's left to fix. I
generally say there's a billion routers left to upgrade.

One of the fantasy numbers that has kept me going for all these years
of living on top ramen was that if aqm and fq technologies I'd worked
on primarily... saved X users 1 second/day of waiting on the internet.
Say X is 10m today, that's 115 days/day and depending on how you want
to calculate that in terms of man years or time spent on the internet,
call it 400 man years per year. Not like any of us can go cash a check
on that karmic bank but, it's comforting.

(have a song: https://www.youtube.com/watch?v=HMG1wKpDT38 )

I tend to think that smashing latencies all through the stack affected
pretty much the whole internet's responsiveness - that and optimizing
web pages, cdns, etc, etc, also saved people a lot of time on waiting
on the internet. And along the way we made webrtc go from postage
stamp 2 frames per second in 2012 to all of civilization managing to
cope with working from home during covid. Imagine, covid-2012?

I've never come up with a number for annoying people less...

Blocking ads is still effective for saving time however, another
annoyance that's cropped up in the last few years is the teaser
paragraph and then
the demand to turn off advertising on a per site basis. I wish there
was a plugin for a browser that blocked content from paywall demanding
sites.
I'm glad I can pay google/pandora/netflix 10 bucks a month for
streaming services without ads.

Anyway, just the deployed aqm/fq solutions alone are in the 10s of
millions, IMHO. Just working so well for those using them that they
never noticed.

--
Fixing Starlink's Latencies: https://www.youtube.com/watch?v=c9gLo6Xrwgw

Dave Täht CEO, TekLibre, LLC

Rpm mailing list
[email protected]
https://lists.bufferbloat.net/listinfo/rpm

Better recommendation for "congestion control"

In the PR #60, Stuart suggested to add a comment about using L4S. However, the goal of that part of the text is to advocate towards using "whatever configuration this service is currently using for all the content". As we want to measure the user-experience. (see my comment in reply to Stuart)

We should make this more explicit.

"1-byte response" -=> Capture some considerations about Header Section sizes

The probe requests are for 1 byte of message content.

An HTTP header section is a complete list of header fields in an HTTP message. In HTTP/2 we need a mandatory :status pseudo-header field, and most servers will add a bunch of other response headers like Date, Server etc. All of these stand a chance at dwarfing your response size. HPACK (or QPACK) is likely to affect the on-wire size of these messages. QPACK could even cause the content to be blocked.

All of these things can interfere with the timing of the HTTP response that we are trying to measure. And they stand to have a lot of variation. I think the specification would do well to add considerations related to header sections, fields and compression.

Clarify definition of probe and RTT aggregation methodology (section 4.2)

In Section 4.2, "probe" is used without precise definition. In order to make it more clear what is being measured, it would be helpful if it is clear that a probe encompasses multiple RTTs (i.e., a probe encompasses

The time it takes a new connection between client and server to do a DNS handshake, a TCP handshake, TLS handshake, and 1-byte HTTP GET.
The time it takes to do a 1-byte HTTP GET on a load-generating connection).

In addition, the technique for aggregating these different RTTs in a final value could be more clear in Section 4.2, too.

I am happy to suggest wording, if you think it would help!

Thanks!
Will

The meaning of a negative change in throughput is underspecified

draft-ietf-ippm-responsiveness/draft-ietf-ippm-responsiveness.txt

Line 405 in 38f2b24

* If the moving average aggregate goodput at interval i is more

The specification says what do to if/when the current throughput moving average is 5% more than the previous throughput moving average, but does not define the meaning/behavior if the difference between the current and the previous moving throughput average is negative. I see this type of behavior often and it would be good to specify the meaning.

Server-based RPM measurement in scope or not?

From IPPM adoption call feedback:

Measuring the maximum RPM as seen from the test server is part of the test or it isn't? That's not clear to me by now.
I'm favouring to make measuring the maximum RPM part of the test, this may help to scale results.

Indicate that this is a first attempt

The sound of the document may sound too much like "this is the ground-truth". Would be good to tone down in the Introduction and say that this is a first attempt at specifying on how to measure bufferbloat.

Revisit size and content of the latency object

According to the current draft latency is measured by accessing a 1 byte file from the server. I would like to [propose an alternative instead of a single byte, have the server report something like the current time of day (as micro- or nano-seconds from midnight in UTC, to not leak uptime) and recommend that servers are time-synced using NTP (*). That way this opens the path to measure OWDs instead of RTTs allowing to better dissect the responsiveness of both legs of the network path between server and client (assuming the client's time is NTP-disciplined as well). IMHO that will introduce a small but non-zero cost on the server and client, but should be immaterial to the actual network traversal as many low layer protocols like ethernet enforce a minimum packet size anyway (plus there is so much per-packet-overhead involved that one of 16 bytes payload will not make a big dent into the actually transferred packet size).

*) maybe have the server encode in the response record if it returns "reliable" time or not....

Specify confidence in the results

Specify how we can determine the confidence in the results.

Hidden Assumption: Bandwidth is fixed

From @GregMirsky on the IPPM adoption call feedback:

It seems like in the foundation of the methodology described in the draft lies the assumption that without adding new flows the available bandwidth is constant, does not change. While that is mostly the case, there are technologies that behave differently and may change bandwidth because of the outside conditions. Some of these behaviors of links with variable discrete bandwidth are discussed in, for example, RFC 8330 and RFC 8625.

Don't use references to apple.com

From Erik Auerswald:

Using "rpm.example" instead of "example.apple.com" would result in shorter
lines for the example JSON.

"host123.cdn.example" instead of "hostname123.cdnprovider.com" might be
a more appropriate example DNS name.

Specify how a server can signal HTTP2 vs HTTP (with our without encryption)

Given that we are interested in making it possible for low-power devices to accurately calculate an RPM without taxing their CPUs doing unnecessary encryption and the new piece of the specification describing how to attribute confidence, I think that we might need to specify how a server can communicate what it supports to a client.

I discussed with Jeroen something I thought might work:

A low-power client can always "power down" what the server supports: a server supports HTTP2 (with SSL)? Okay, the client can use that or it can choose to do a CPU-efficient HTTP test (knowing full well the confidence will be lower).

A low-power server can always "power down" what the client supports: the client will only go as "high" as what the server supports.

The answer might be as simple as advertising https vs http but I thought I would write down this issue regardless.

network-quality / draft-ietf-ippm-responsiveness Goto Github PK

draft-ietf-ippm-responsiveness's People

Contributors

Stargazers

Watchers

Forkers

draft-ietf-ippm-responsiveness's Issues

Recommend Projects

Recommend Topics

Recommend Org