As seen in <a class="issue-link js-issue-link" data-error-text="Failed to load title"

HPACK encoder doesn't handle duplicate headers appropriately about hyper HOT 10 CLOSED

python-hyper commented on August 17, 2024

HPACK encoder doesn't handle duplicate headers appropriately

from hyper.

Comments (10)

Lukasa commented on August 17, 2024

Right, the actual HPACK logic is easy. What's substantially harder is trying to represent the multiple header values on the far side. My current solution is a dictionary whose values are lists, but this causes huge pain in the tests.

That's probably ok, I'll see if the tests can be fixed tomorrow.

from hyper.

Lukasa commented on August 17, 2024

Actually, this isn't simple.

We need to handle two separate situations:

Headers with multiple values concatenated together using NULL bytes.
Identical headers sent multiple times.

The HPACK decoder uses Python sets internally, which allow for efficient differencing. Unfortunately, it doesn't allow us to easily have multiple headers in the same header set. Looks like this may require a more dramatic rearchitecting of the HPACK implementation.

from hyper.

Lukasa commented on August 17, 2024

Yeah, the rewrite here is pretty big, but the impact of this bug is low. I won't block releases behind this bug.

EDIT: This got discussed at length on the WG list. The short position is that hyper's HPACK implementation is totally wrong because of some subtleties in the spec.

from hyper.

Lukasa commented on August 17, 2024

So, given that this is a full redesign, may as well do my design out here. Open development, right?

Specific notes:

A header set is an unordered collection of key-value pairs, potentially containing duplicates.
The header table is a fixed-size ordered collection of key-value pairs, potentially containing duplicates.
The reference set is an unordered collection of references to elements in the header table, never containing duplicates.

Specific requirements:

Encoder will need to accept an iterable of tuples (the only Pythonic way to represent a header set).
Decoder will need to return an iterable of tuples (for exactly the same reason).

Notes:

Implementing the reference set in Python is a little awkward. If I want it to be a set, it needs to be a set of values that can be used to find the element in the header set. This can't be names or weakrefs, because they'll have the same hash collision problem that affects the current implementation. Indices into the header table is a terrible idea because we would need to increment/decrement them all each time the header table changes size, which is hugely expensive. What I need is a 'reference' type that hashes uniquely for each index, but compares equal. Easy enough to do, though it's pretty nasty.

from hyper.

Lukasa commented on August 17, 2024

Let's first consider decoding. The algorithm is as follows:

First, prepare a header set to be emitted. This set is initially empty.

Next, determine the representation of the element. This is one of: "indexed", "literal added to header table", "literal not added to header table".

If the representation is "indexed", check whether it's in the reference set (does anything in the reference set compare equal to the reference?). If it is, remove it from the reference set and do not emit the header. If it is not, and it references the static table: 1) emit the header; 2) insert the static entry at the start of the header table; 3) add a reference to the reference set. Otherwise, if it is not in the reference set and references the header table: 1) emit the header; 2) add a reference to the reference set.

If the representation is "literal not added to the header table", only emit the header.

If the representation is "literal added to the header table": 1) emit the header; 2) insert the header at the beginning of the header table; 3) add a reference to the reference set.

Finally, everything in the reference set that has not been emitted as part of this process gets added to the header set.

Typically, this last step is the most tricky bit. The core problem is that we need a way to work out whether a header in the reference set has been emitted yet. Fundamentally, any header that was added to the reference set and remains in there as part of this procedure was emitted, and none of the others were. However, we need a way to keep track of which ones we emitted.

Some options for this:

Have the header set be a combination of literal key-value pairs and 'reference set references', then have the 'reference set references' be converted to literals upon completing processing. This is tricky, as it's possible the 'reference set references' are referencing things that are no longer in the header table, so they'd need to keep copies. This is finnicky.
Have the reference set be marked as to whether they were added in this round of processing. Painful performance characteristics here because one way or another we need to touch every reference every round of processing.
Have a temporary set (an actual set) of references that were added to the set in this round of processing. This can be differenced against the reference set to find the rest. Probably the nicest, and currently what I'm leaning towards.

from hyper.

Lukasa commented on August 17, 2024

Alright, so how does this 'reference' object work?

The constraints are as follows:

A 'reference' must evaluate equal to any other reference to a specific entry in the header table.
A 'reference' must evaluate not equal to an reference to a different entry in the header table, even if that entry evaluates equal to the entry referenced by the other reference (what a terrible sentence).
For use with sets, the hashing rules apply as above as well.

The obvious implementation here is to use object identity on the tuples.

from hyper.

Lukasa commented on August 17, 2024

Ok, so the decoder is done and seems like it works.

Encoder time. This is a little harder. Big problem is repeated headers. If we have a unique header (i.e. name-value pair) then either it's in the reference set and we don't encode it, or it's not and we do encode it (and add it to the reference set). If we have a header that's repeated, each instance following the first needs to be encoded as two headers, one removing from the reference set and the other re-adding it (and emitting it again).

Some examples:

If header a is present four times in the header set and is not present in the reference set, it needs to be encoded 7 times: once to originally put it in the reference set, then three more instances of removing and re-adding.
If header a is present four times in the header set and is present in the reference set, it needs to be encoded 8 times: four instances of removing and re-adding.

Encoding this in a way that doesn't special-case is tricky, and I just don't think I can do it in a way that I really like. The problem is that searching the header set for all repeated headers every single time seems wildly inefficient. Is it? Not sure.

from hyper.

Lukasa commented on August 17, 2024

It's not inefficient if we use the Counter data structure! This is an awesome brainwave. =D

from hyper.

Lukasa commented on August 17, 2024

I ended up going a totally different route with this, using the ability of HPACK to be streamed. It involved a pretty gnarly rewrite, the sum-total of which can be seen here.

This passes all my tests including the fixed up HPACK integration tests. Time for me to go back and re-open http2jp/hpack-test-case#13 to confirm that hyper now functions appropriately.

from hyper.

Lukasa commented on August 17, 2024

Definitely fixed.

from hyper.

HPACK encoder doesn't handle duplicate headers appropriately about hyper HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent