While testing ouinet with the browser as indicated in the readme, I accessed the IPFS

Seems there are three problems to address here: Do we want to

This issue covers two different topics, splitted into <a class="issue-link js-issue-li

Upload HTTP headers and document data separatedly about ouinet HOT 6 CLOSED

equalitie commented on May 12, 2024

Upload HTTP headers and document data separatedly

from ouinet.

Comments (6)

inetic commented on May 12, 2024

Seems there are three problems to address here:

Do we want to split the header and body into different pieces?
What do we want to be included in the "key"?
Do we want to hash the key?

I kind of see the point in (1), e.g. some app could store a raw cat.jpg picture into the cache and fetch it without the header. On the other hand such app could easily download it with the header in a same manner as it would if it was downloading it using HTTP. Another argument against this could be that it (likely) takes longer to search into the DHT for two items than for just one.

About (2), it's probably a very good idea to support multiple languages, but I think the number of variables in the key should be limited as much as possible. It's because with each such variable the number of keys per URL grows exponentially. This would (a) make the database huge and (b) would (also exponentially) decrease the number of peers in a swarm corresponding to any particular key.

For example, for the canonical request:

GET /foo.html HTTP/1.1
Accept: text/html,application/xhtml+xm…plication/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US;q=0.7,en;q=0.3
Host: example.com

Does it make sense to store that the requester asked for HTTP/1.1?
Are there modern browsers that don't support compression?
Do we care about the order of requester's language preference?
Do we want two separate swarms for en-US and en with k and l peers respectively, or do we prefer one big swarm with k+l peers?
Do we care about the 'q' parameters?
Given that we know that example.com/foo.html has mime type text/html, do we need to store that the client would have accepted other types as well?

Lastly, I think the main reason to hash the keys would be to obfuscate the content. Thus it wouldn't be trivially possible to see what's stored in the database. On the other hand it would still be possible just by fetching the values from ipfs, or guessing. I'm not totally convinced we need that, but I'm not against either, perhaps we need to list more pros and cons and make a consensus in the team. Also, there is still the chance that we'll be able to persuade the guys from IPFS to add salt to their mutable DHT data as BitTorrent does. In such case we wouldn't even need the database.

In the mean time, we could encode the keys in a similar way you suggested by concatenating all the important variables in a string, separating them with a colon. E.g.:

GET:http://example.com/foo.html?bar=baz:en

from ouinet.

ivilata commented on May 12, 2024

Regarding (1), by uploading content as is we don't force other apps to use the HTTP-like (or any other) encoding. As for doubling the number of requests to the DHT, I'd expect for its cost to be overtaken by IPFS DHT queries to fetch the body. Also, if we have an actual browser with its own cache using the client, it may try actual HTTP HEAD requests beforehand which may result in less and smaller transfers (just the head).

Regarding (2), I acknowledge that the devil is in the details and we should go over HTTP request headers to choose which ones to include and how to preprocess their values to avoid an explosion of keys while not discriminating some users (e.g. language-wise). I just kept the 3 ones which I think may affect the actual content returned by the origin server, but careful review is needed. We cannot skip headers like Content-Type (or their values) since the client needs to know the canonical request before getting the answer from the server (e.g. to get content from the cache). Also, please note that when several requests map to the same content (e.g. because the server ignores or lacks most accepted languages), several clients which used different canonical requests may still provide the content to others, but only as long as head and body are stored separatedly (see point (1)).

Regarding (3), hashing is specially useful in this specific proposal since using the whole request as an index would make the db way bigger. Yes it practically obfuscates the index of the db but if the owner of an injector would like to know what it is storing, the injector could as well store the request itself (locally or in IPFS, which should map to the key which appears in the index — ideally).

from ouinet.

inetic commented on May 12, 2024

I think I'm sold on (1).

We cannot skip headers like Content-Type (or their values) since the client needs to know the canonical request before getting the answer from the server (e.g. to get content from the cache).

I'm not sure why, can you please elaborate on that?

Also, please note that when several requests map to the same content (e.g. because the server ignores or lacks most accepted languages), several clients which used different canonical requests may still provide the content to others, but only as long as head and body are stored separatedly (see point (1)).

Sharing one content across multiple canonical requests is indeed nice, but having many swarms per URL would still make it likely that users wouldn't be able to fetch the headers because no one is in a DHT swarm corresponding to a particular request some user just made. No?

I still have some questions about the third paragraph, but they depend on the above.

from ouinet.

ivilata commented on May 12, 2024

We cannot skip headers like Content-Type (or their values) since the client needs to know the canonical request before getting the answer from the server (e.g. to get content from the cache).

I'm not sure why, can you please elaborate on that?

Oh sorry for the confusion, I wrote Content-Type (response header) where I should've written Accept (request header). So, if Accept-Language includes (say) French and English, we really cannot know what the Language of the response will be until we have the actual response from the server. Thus, the only way to reduce Accept-Language in the canonical request to the actual value of Language from the response would be for the injector to compute it post facto.

Now imagine that the server returned a page in English. If the same or a different client wanted to retrieve the page (with the same FR-EN preference) and it wasn't able to reach the origin (nor the injector), when canonicalizing the request on its own, if the process just kept French (1st lang preference) in Accept-Language, it's pre facto version of the request wouldn't match the injector's post facto version and the client wouldn't be able to retrieve a page which was actually in the distributed cache.

One solution to this is to have a clear canonicalization process which happens pre facto at the client side, so that an injector just checks that its format is ok and forwards it to the origin.

(Plase note that the GET:http://example.com/foo.html?bar=baz:en encoding that you suggested is such sort of — very compact — request canonicalization using it as a key instead of its hash.)

Also, please note that when several requests map to the same content (e.g. because the server ignores or lacks most accepted languages), several clients which used different canonical requests may still provide the content to others, but only as long as head and body are stored separatedly (see point (1)).

Sharing one content across multiple canonical requests is indeed nice, but having many swarms per URL would still make it likely that users wouldn't be able to fetch the headers because no one is in a DHT swarm corresponding to a particular request some user just made. No?

That's the point where we must strike a balance between diversity (pushing for more/richer headers, e.g. keeping multiple entries in Accept-Language, possibly with country hints) and swarmability/privacy (pushing for less/simpler headers, e.g. having a single, language-only Accept-Language or even none). Maybe there could be a configurable "privacy level" (or its inverse) where a user could progressively toggle content customization options (language, encoding, etc.) to get different levels of privacy, customization or swarmability. It would affect which headers would be included in the request and their richness, but in any case the rules used to canonicalize these headers should be clear.

from ouinet.

inetic commented on May 12, 2024

If we don't hash the canonized requests, then the client could apply its own logic for choosing a language.

E.g. say that the database contained entries:

GET:http://example.com/foo.html?bar=baz:en
GET:http://example.com/foo.html?bar=baz:fr
GET:http://example.com/foo.html?bar=baz:es

and the user would send a request with Accept-Language first fr and then en. The client would in such case be able to sort these entries and return the fr version first. Granted that this could get more complicated if we start to require sorting by multiple parameters, though I'd say its still preferable to spend CPU cycles on users's device than reduce swarm sizes.

For the argument of hashing the canonized request to compress the keys, I think actually compressing the database before it's put into IPFS may be a better approach (or perhaps IPFS already does so?).

from ouinet.

ivilata commented on May 12, 2024

This issue covers two different topics, splitted into #8 (separation of response headers and body) and #9 (use of canonical requests instead of URLs as keys).

from ouinet.

Upload HTTP headers and document data separatedly about ouinet HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent