Comments (6)
Seems there are three problems to address here:
- Do we want to split the header and body into different pieces?
- What do we want to be included in the "key"?
- Do we want to hash the key?
I kind of see the point in (1), e.g. some app could store a raw cat.jpg picture into the cache and fetch it without the header. On the other hand such app could easily download it with the header in a same manner as it would if it was downloading it using HTTP. Another argument against this could be that it (likely) takes longer to search into the DHT for two items than for just one.
About (2), it's probably a very good idea to support multiple languages, but I think the number of variables in the key should be limited as much as possible. It's because with each such variable the number of keys per URL grows exponentially. This would (a) make the database huge and (b) would (also exponentially) decrease the number of peers in a swarm corresponding to any particular key.
For example, for the canonical request:
GET /foo.html HTTP/1.1
Accept: text/html,application/xhtml+xm…plication/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US;q=0.7,en;q=0.3
Host: example.com
- Does it make sense to store that the requester asked for HTTP/1.1?
- Are there modern browsers that don't support compression?
- Do we care about the order of requester's language preference?
- Do we want two separate swarms for
en-US
anden
withk
andl
peers respectively, or do we prefer one big swarm withk+l
peers? - Do we care about the 'q' parameters?
- Given that we know that example.com/foo.html has mime type text/html, do we need to store that the client would have accepted other types as well?
Lastly, I think the main reason to hash the keys would be to obfuscate the content. Thus it wouldn't be trivially possible to see what's stored in the database. On the other hand it would still be possible just by fetching the values from ipfs, or guessing. I'm not totally convinced we need that, but I'm not against either, perhaps we need to list more pros and cons and make a consensus in the team. Also, there is still the chance that we'll be able to persuade the guys from IPFS to add salt to their mutable DHT data as BitTorrent does. In such case we wouldn't even need the database.
In the mean time, we could encode the keys in a similar way you suggested by concatenating all the important variables in a string, separating them with a colon. E.g.:
GET:http://example.com/foo.html?bar=baz:en
from ouinet.
Regarding (1), by uploading content as is we don't force other apps to use the HTTP-like (or any other) encoding. As for doubling the number of requests to the DHT, I'd expect for its cost to be overtaken by IPFS DHT queries to fetch the body. Also, if we have an actual browser with its own cache using the client, it may try actual HTTP HEAD requests beforehand which may result in less and smaller transfers (just the head).
Regarding (2), I acknowledge that the devil is in the details and we should go over HTTP request headers to choose which ones to include and how to preprocess their values to avoid an explosion of keys while not discriminating some users (e.g. language-wise). I just kept the 3 ones which I think may affect the actual content returned by the origin server, but careful review is needed. We cannot skip headers like Content-Type
(or their values) since the client needs to know the canonical request before getting the answer from the server (e.g. to get content from the cache). Also, please note that when several requests map to the same content (e.g. because the server ignores or lacks most accepted languages), several clients which used different canonical requests may still provide the content to others, but only as long as head and body are stored separatedly (see point (1)).
Regarding (3), hashing is specially useful in this specific proposal since using the whole request as an index would make the db way bigger. Yes it practically obfuscates the index of the db but if the owner of an injector would like to know what it is storing, the injector could as well store the request itself (locally or in IPFS, which should map to the key which appears in the index — ideally).
from ouinet.
I think I'm sold on (1).
We cannot skip headers like Content-Type (or their values) since the client needs to know the canonical request before getting the answer from the server (e.g. to get content from the cache).
I'm not sure why, can you please elaborate on that?
Also, please note that when several requests map to the same content (e.g. because the server ignores or lacks most accepted languages), several clients which used different canonical requests may still provide the content to others, but only as long as head and body are stored separatedly (see point (1)).
Sharing one content across multiple canonical requests is indeed nice, but having many swarms per URL would still make it likely that users wouldn't be able to fetch the headers because no one is in a DHT swarm corresponding to a particular request some user just made. No?
I still have some questions about the third paragraph, but they depend on the above.
from ouinet.
We cannot skip headers like Content-Type (or their values) since the client needs to know the canonical request before getting the answer from the server (e.g. to get content from the cache).
I'm not sure why, can you please elaborate on that?
Oh sorry for the confusion, I wrote Content-Type
(response header) where I should've written Accept
(request header). So, if Accept-Language
includes (say) French and English, we really cannot know what the Language
of the response will be until we have the actual response from the server. Thus, the only way to reduce Accept-Language
in the canonical request to the actual value of Language
from the response would be for the injector to compute it post facto.
Now imagine that the server returned a page in English. If the same or a different client wanted to retrieve the page (with the same FR-EN preference) and it wasn't able to reach the origin (nor the injector), when canonicalizing the request on its own, if the process just kept French (1st lang preference) in Accept-Language
, it's pre facto version of the request wouldn't match the injector's post facto version and the client wouldn't be able to retrieve a page which was actually in the distributed cache.
One solution to this is to have a clear canonicalization process which happens pre facto at the client side, so that an injector just checks that its format is ok and forwards it to the origin.
(Plase note that the GET:http://example.com/foo.html?bar=baz:en
encoding that you suggested is such sort of — very compact — request canonicalization using it as a key instead of its hash.)
Also, please note that when several requests map to the same content (e.g. because the server ignores or lacks most accepted languages), several clients which used different canonical requests may still provide the content to others, but only as long as head and body are stored separatedly (see point (1)).
Sharing one content across multiple canonical requests is indeed nice, but having many swarms per URL would still make it likely that users wouldn't be able to fetch the headers because no one is in a DHT swarm corresponding to a particular request some user just made. No?
That's the point where we must strike a balance between diversity (pushing for more/richer headers, e.g. keeping multiple entries in Accept-Language
, possibly with country hints) and swarmability/privacy (pushing for less/simpler headers, e.g. having a single, language-only Accept-Language
or even none). Maybe there could be a configurable "privacy level" (or its inverse) where a user could progressively toggle content customization options (language, encoding, etc.) to get different levels of privacy, customization or swarmability. It would affect which headers would be included in the request and their richness, but in any case the rules used to canonicalize these headers should be clear.
from ouinet.
If we don't hash the canonized requests, then the client could apply its own logic for choosing a language.
E.g. say that the database contained entries:
GET:http://example.com/foo.html?bar=baz:en
GET:http://example.com/foo.html?bar=baz:fr
GET:http://example.com/foo.html?bar=baz:es
and the user would send a request with Accept-Language
first fr
and then en
. The client would in such case be able to sort these entries and return the fr
version first. Granted that this could get more complicated if we start to require sorting by multiple parameters, though I'd say its still preferable to spend CPU cycles on users's device than reduce swarm sizes.
For the argument of hashing the canonized request to compress the keys, I think actually compressing the database before it's put into IPFS may be a better approach (or perhaps IPFS already does so?).
from ouinet.
This issue covers two different topics, splitted into #8 (separation of response headers and body) and #9 (use of canonical requests instead of URLs as keys).
from ouinet.
Related Issues (20)
- Injector crashes when listening on TCP HOT 3
- Segfault when client or injector exits with error
- Desktop client ignores "--injector-credentials" option HOT 1
- "Connection: close" header in HTTP GET request results in "502 Bad Gate way error" HOT 1
- Incorrect error handing in `OuiServiceServer::start_listen` HOT 1
- Use std::regex instead of Boost's HOT 1
- Browser gets stuck when requesting from nonexistent host HOT 1
- Injector not publishing IPNS until insertion HOT 1
- Do not cache hop-by-hop HTTP response headers HOT 2
- Canonicalize URL used as db index key
- Hard-coded B-tree index for /api/descriptor
- Set up server HOT 1
- Apply for Ouinet security review/audit
- Update android build to use gradle 7 HOT 2
- Add support for publishToMavenLocal to android build HOT 3
- Remove need for build-android.sh
- Ouinet AAR reports its version name differently after update to Gradle 7
- Unable to build armeabi-v7a with min API 16 HOT 1
- make error z-lib related, on ubuntu 20.04 HOT 4
- Use canonical HTTP requests instead of URLs as db indexes
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ouinet.