wicg / compression-dictionary-transport Goto Github PK

License: Other

compression-dictionary-transport's Introduction

Compression dictionary transport

What is this?

This explainer outlines the benefits of compression dictionaries, details the different use case for them, and then proposes a way to deliver such dictionaries to browsers to enable these use cases.

The HTTP headers and negotiation are specified in the IETF Draft document for Compression Dictionary Transport.

Summary

This proposal adds support for using designated previous responses as an external dictionary for HTTP responses for compression schemes that support external dictionaries (e.g. Brotli and Zstandard).

HTTP Content-Encoding is extended with new encoding types and support for allowing responses to be used as dictionaries for future requests. All actual header values and names still TBD:

Server responds to a request for a cacheable resource with a Use-As-Dictionary: <options> response header.
The client will store a hash of the uncompressed response and the applicable match URL pattern for the resource with the cached response to identify it as a dictionary.
On future requests, the client will match a request against the available dictionary match URL patterns. If multiple patterns are matched, the most-specific match is used. If a dictionary is available for a given request, the client will add an appropriate compression scheme (e.g. br-d for shared brotli) to the Accept-Encoding request header as well as an Available-Dictionary: <sf-binary SHA-256> header with the hash of the best available dictionary. The hash is sent as a Structured Field Byte Sequence (base64-encoded, enclosed by colons). e.g. Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:.
If the server has a compressed version of the request URL with the matching dictionary, it serves the dictonary-compressed response with the applicable Content-Encoding: (e.g. br-d) and Vary: Accept-Encoding,Available-Dictionary.

For interop reasons, dictionary-based compression is only supported on secure contexts (similar to brotli compression).

There are also some browser-specific features independent of the transport compression:

For security and privacy reasons, there are CORS requirements (detailed below) for both the dictionary and compressed resource.
In order to populate a dictionary for future use, a server can respond with link tag or header to trigger an idle-time fetch specifically for a dictionary for future use. e.g. <link rel=dictionary href=[dictionary_url]>.

Background

What are compression dictionaries?

Compression dictionaries are bits of compressible content known ahead of time. They are being used by compression engines to reduce the size of compressed content.

Because they are known ahead of time, the compression engine can refer to the content in the dictionary when representing the compressed content, reducing the size of the compressed payload. The decompression engine can then interpret the content based on that pre-defined knowledge..

Taken to the extreme, if the compressed content is identical to the dictionary, the entire delivered content be a few bytes referring to the dictionary.

Now, you may ask, if dictionaries are so awesome, then...

Why aren't browsers already using compression dictionaries?

To some extent, they are. The brotli compression scheme includes a built-in dictionary that was built to work reasonably well for HTML, CSS and JavaScript. Custom (shared) dictionaries have a more complicated history.

At some point, Chrome did support a shared compression dictionary. When Chrome was first released, it supported a dictionary compression method called SDCH (Shared-dictionary Compression over HTTP). That support was unshipped in 2016 due to complexities around the protocol’s implementation, specification and lack of an interoperability story.

SDCH enabled Chrome and Chromium-based browsers to create origin-specific dictionaries, that were downloaded once for the origin and enabled multiple pages to be compressed with significantly higher rates. That's one use case for compression dictionaries we will call the "Shared dictionary" use case.

There's another major use case for shared dictionaries that was never supported by browsers - delta compression.

That use-case would enable the browser to reuse past resources (e.g. your site's main JS v1.2) in order to compress future ones (e.g. main JS v1.3). But traditionally, this use-case raised complexities around the abilities of the browser to coordinate its cache state with the server, and agree on what the dictionary would be. It also raised issues with both sides having to store all past versions of each resource in order to successfully be able to compress and decompress it.

The common thread is that the use of compression dictionaries had run into various complexities over the years which resulted in deployment issues.

This time will be different

A few things about this current proposal are different from past attempts, in ways we're hoping are meaningful:

CORS-based restrictions can ensure that public and private resources don't get mixed in ways that can leak user data.
Same-origin, path and destination-based matching would help us manage a "single possible dictionary per request" policy, which will minimize client-side cache fan-out.
Dictionaries must already be available on the client to be used (fetching of the dictionary is not in the critical path of a resource fetch).
Diff-caching on the server can simplify and enable the server-side deployment story.

Use cases

Compression types

There are two primary models for using shared dictionaries that are similar but differ in how the dictionary is fetched:

Delta compression - reusing past downloaded resources for compressing future updates of the same or similar resources.
Shared dictionary - a dedicated dictionary is downloaded out-of-band, and then used to compress and decompress resources on the page.

In both cases the client advertises the best-available dictionary that it has for a given request. If the server has a delta-compressed version of the resource, compressed with the advertized dictionary, it can just send that delta-compressed diff. It can also use that advertized dictionary (if available) to dynamically compress that resource.

With the Delta compression use case, a previously-downloaded version of the resource is available to use for future requests as a dictionary. For example, with a JavaScript file, v1 of the file may be in the browser's cache and available for use as a dictionary to use when fetching v2 so only the difference between the two needs to be transmitted.

In the Shared dictionary use case, the dictionary is a purpose-built dictionary that is fetched using a <link> tag and can be used for future requests that match the match URL pattern covered by the dictionary. For example, on a first visit to a site, the HTML response references a custom dictionary that should be used for document fetches for that origin. The dictionary is downloaded at some point by the browser and, on future navigations through the site, is advertised as being available for document requests that match the URL pattern that the dictionary applies to.

Risks

Security

The Shared Brotli draft does a good job describing the security risks. In summary:

CRIME and BREACH mean that both the resource being compressed and the dictionary itself can be considered readable by the document deploying them. That is Bad™ if any of them contains information that the document cannot already obtain by other means.
An out-of-band dictionary needs to be carefully examined to ensure that it wasn’t created using users’ private data, nor using content that’s user controlled.

Privacy

Dictionaries will need to be cached using a triple key (top-level site, nested context site, URL) similar to other cached resources (or any other partitioning scheme that’s good enough for cached resources and cookies from a privacy and security perspective). That’s not an issue for the delta compression use case, but can become a burden fast for the out-of-band dictionaries, as multiple nested contexts may need to download the same dictionary multiple times.

Note: Common payload caching may be useful in such cases.

There’s also the issue of users advertising resource versions in their cache to servers as part of the request. This already has a precedence in terms of cache validators (ETags, If-Modified-Since), so maybe that’s fine, given that the cache is partitioned.

Adverse performance effects

Downloading an out-of-band dictionary means that the site owner is making a certain bet regarding the amount of visits that would enable the user to amortize that dictionary’s cost.

At worst, if the user never visits the site again until the dictionary’s lifetime expires, the user has paid the cost of downloading the dictionary with no benefits.

For some large and heavily trafficked sites, that case is rare. For others, it’s extremely common, and we should be wary of both the tools we’d be putting in developers’ hands, as well as the messaging we’re providing them regarding when to use them.

Proposal

Static resources flow

In this flow, we’re reusing static resources themselves as dictionaries that would be used to compress future updates of themselves, or similar resources.

example.com downloads example.com/large-module.wasm for the first time.
The response for example.com/large-module.wasm contains a Use-As-Dictionary: <options> response header. The options are a structured field dictionary that includes the ability to set a URL-matching pattern, matching fetch destination, and an opaque identifier. More details here.
The client saves the URL pattern, destination (if provided), ID and a SHA-256 hash of the resource with the cached resource.
- For browser clients, the response must also be non-opaque in order to be used as a dictionary. Practically, this means the response is either same-origin as the document or is a cross-origin request with an Access-Control-Allow-Origin: response header that makes the response readable by the document.
The next time the browser fetches a resource from a URL that matches a pattern covered by a dictionary in cache and with a fetch destination that matches the provided destination, it includes an Available-Dictionary: request header, which lists a single hash (encoded as a Structured Field Byte Sequence).
- The request is limited to specifying a single dictionary hash both to reduce the header overhead and limit the cardinality of the Available-Dictionary: request header (to limit variations in the Vary caches).
- If there is an ID associated with the dictionary then it is sent in a separate Dictionary-ID request header.
- Any new resource as a dictionary with the same URL-matching pattern would override older ones. When sending requests, the browser would use the most specific match for the request to get its dictionary. Specificity is determined by the string length of the match pattern specified with the dictionary.
When the server gets a request with the Available-Dictionary header in it:
- If the client sent a sec-fetch-mode: cors request header then the dictionary should be ignored unless the response will have an Access-Control-Allow-Origin: response header that includes the origin of the page the request was issued from (* or matched against the origin: or referer:).
- The server can simply ignore the dictionary if it doesn't have a diff that corresponds to said dictionary. In that case the server can serve the response without delta compression.
- If the server does have a corresponding diff, it can respond with that, indicating that as part of its Content-Encoding header as well as a Content-Dictionary response header with the hash of the dictionary that was used (must match the hash from the Available-Dictionary request header).
  - For example, if we're using shared brotli compression, the Accept-Encoding: deflate, gzip, br, br-d request would respond with Content-Encoding: br-d.
In case the browser advertized a dictionary but then fails to successfully fetch it from its cache and the dictionary was used by the server, the resource request should fail.
For browser clients, the response must be non-opaque in order to be decompressed with a shared dictionary. Practically, this means the response is either same-origin as the document or is a cross-origin request with an Access-Control-Allow-Origin: response header that makes the response readable by the document.

Dynamic resources flow

Shared dictionary is declared ahead-of time and then downloaded out of band using a Link: header on the document response or <link> HTML tag with a rel=dictionary type.
- The dictionary resource will be downloaded with CORS in “omit” mode to discourage including user-specific private data in the dictionary, since its data will be readable without credentials.
- It will be downloaded with “idle” priority, once the site is actually idle.
- Browsers may decide to not download it when they suspect that the user is paying for bandwidth, or when used by sites that are not likely to amortize the dictionary costs (e.g. sites that the user isn’t visiting frequently enough).
- Browsers may decide to not use a shared dictionary if it contains hints that its contents are not public (e.g. Cache-Control: private headers).
The dictionary response must include the Use-As-Dictionary: <options> header, appropriate cache lifetime headers and will be used for future requests using the same process as the Static resources flow.
- For browser clients, the response must also be non-opaque in order to be used as a dictionary. Practically, this means the response is either same-origin as the document or is a cross-origin request with an Access-Control-Allow-Origin: response header that makes the response readable by the document.

Dictionary options header

The Use-As-Dictionary: response header is a structured field dictionary that allows for setting multiple options and for future expansion. The supported options and defaults are:

match - URL-matching pattern for the dictionary to apply to. Required. This is a patternString for a URLPattern URLPattern(patternString, baseURL) constructor where the baseURL is the URL of the request and where support for regexp tokens is disabled. URLPattern allows for absolute or relative URLs. e.g. /app1/main* will match https://www.example.com/app1/main_12345.js and main* in response to https://www.example.com/app1/main_1.js will match https://www.example.com/app1/main.xyz.js. Dictionaries will only match requests from the same origin as the dictionary.
match-dest - An optional Structured Field Inner List of string values of matching request destinations. The default value is An empty list (()) which will match all request destinations.
id - An optional server-provided dictionary ID string. The string is opaque to the client and echoed back to the server in a Dictionary-ID request header when the dictionary matches an outbound request. The default value is an empty string ("").

For example: use-as-dictionary: match="/app1/main*", match-dest=("script"), id="xxx" would specify matching on a path prefix of /app1/main for script requests and to send Dictionary-ID: "xxx" for any requests that match the dictionary.

Compression algorithms

The dictionary negotiation is independent of the compression algorithm that is used for compressing the HTTP response and is designed to support any compression scheme that supports using external compression dictionaries. Currently that includes Brotli and Zstandard but it is not limited to those (and depends on the what the client and server both support). It is likely that, in the future, content-specific compression schemes that handle delta-compression better may be built (i.e. code-aware Wasm compression).

The compression algorithm negotiation uses the regular Accept-Encoding:/Content-Encoding: negotiation that is used for non-dictionary compression. It is important that new names are registered with the HTTP Content Coding Registry for algorithms that use an external dictionary to prevent situations where processing along the request flow may attempt to decode a response using just the algorithm without being dictionary-aware. That way, if anything in the request flow needs to operate on the decoded content, it can either be made aware of the dictionary-based compression or it can modify the Accept-Encoding: request header to only support schemes that it is aware of (already common practice).

The examples in this document will use br-d for dictionary-based Brotli compression but the actual algorithm(s) negotiated could be anything that the client supports.

Compression API

The compression API can also expose support for using caller-supplied dictionaries but that is out-of-scope for this proposal.

Websockets

Websocket support is out-of-scope for this proposal but there is nothing in the current dictionary negotiation that precludes websockets from being able to build dictionary-based compression (either by leveraging parts of what is provided here or building something separate).

Security and Privacy

Dictionary and Resource readability (CORS)

Since the contents of the dictionary and compressed resource are both effectively readable through side-channel attacks, this proposal makes it explicit and requires that both be CORS-readable from the document origin. The origin for the URL the dictionary was served from and the origin of the match pattern for URLs MUST be the same (i.e. the dictionary and compressed resource must both be from the same origin).

For dictionaries and resources that are same-origin as the document, no additional requirements exist as both are CORS-readable from the document context. For navigation requests, their resource is by definition same-origin as the document their response will eventually commit. As a result, the dictionaries that match their URL pattern are similarly same-origin.

For dictionaries and resources served from a different origin than the document, they must be CORS-readable from the document origin. e.g. Access-Control-Allow-Origin: <document origin or *>. This means that any crossorigin content that is fetched in no-cors mode by default must enable CORS-fetching (usually with the crossorigin attribute).

When sending a CORS request with an available dictionary, a browser should only include the Available-Dictionary: header if it is also sending the sec-fetch-mode: header so a CORS-readable decision can be made on the server before responding.

In order to prevent sending dictionary-compressed responses that the client will not be able to process, when a server receives a request with sec-fetch-mode: cors as well as a Available-Dictionary: dictionary, it should only use the dictionary if the response includes a Access-Control-Allow-Origin: response header that includes the origin of the page the request was made from. Either by virtue of Access-Control-Allow-Origin: * covering all origins or if Access-Control-Allow-Origin: includes the origin in the origin: or referer: request header. If there is no origin: or referer: request header and Access-Control-Allow-Origin: is not * then the dictionary should not be used.

To discourage encoding user-specific private information into the dictionaries, any out-of-band dictionaries fetched using a <link> will be uncredentialed fetches.

These protections against compressing opaque resources make CORB and ORB considerations unnecessary as they are specific to protecting opaque resources.

Fingerprinting

The existence of a dictionary is effectively a cookie for any requests that match it and should be treated as such:

Storage partitioning for dictionary resource metadata should be at least as restrictive as for cookies.
Dictionary entries (or at least the metadata) should be cleared any time cookies are cleared.

The existence of support for dictionary-based Accept-Encoding: has the potential to leak client state information if not applied consistently. If the browser supports dictionary-based compression algorithms encoding then it should always be advertised, independent of the current state of the feature. Specifically, this means that in any private browsing mode (Incognito in Chrome), dictionary-based algorithm support should still be advertised even if the dictionaries will not persist so that the state of the private browsing mode is not exposed.

Triggering dictionary fetches

The explicit fetching of a dictionary through a <link rel=dictionary> tag or Link: header is functionally equivalent to <link rel=preload> with different priority and should be treated as such. This means that the Link: header is only effective for document navigation responses and can not be used for subresource loads.

This prevents passive resources, like images, from using the dictionary fetch as a side-channel for sending information.

Cache/CDN considerations

Any caches between the server and the client will need to be able to support Vary on both Accept-Encoding and Available-Dictionary, otherwise the responses will be either corrupt (in the case of serving a dictionary-compressed resource with the wrong dictionary) or ineffective (serving a non-dictionary-compressed resource when dictionary compression was possible).

Any middle-boxes in the request flow will also need to support the dictionary-compressed content-encoding, either by passing it through unmodified or by managing the appropriate dictionaries and compressed resources.

Examples

Bundled JavaScript on separate origin

In this example, www.example.com will use a bundle of application JavaScript that they serve from a separate static domain (static.example.com). The JavaScript files are versioned and have a long cache time, with the URL changing when a new version of the code is shipped.

On the initial visit to the site:

The browser loads https://www.example.com/ which contains <script src="//static.example.com/app/main.js/123" crossorigin> (where 123 is the build number of the code).
The browser requests https://static.example.com/app/main.js/123 with Accept-Encoding: br-d,br,gzip.
The server for static.example.com responds with the file as well as Use-As-Dictionary: match="/app/main.js*", Access-Control-Allow-Origin: https://www.example.com and Vary: Accept-Encoding,Available-Dictionary.
The browser caches the js file along with a SHA-256 hash of the decompressed file and the https://www.example.com/app/main.js* URL pattern.

sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<script src="//static.example.com/app/main.js/123" crossorigin>...
Browser->>static.example.com: GET /app/main.js/123<br/>Accept-Encoding: br,gzip
static.example.com->>Browser: Use-As-Dictionary: match="/app/main.js"<br/>Access-Control-Allow-Origin: https://www.example.com<br/>Vary: Accept-Encoding,Available-Dictionary

At build time, the site developer creates delta-compressed versions of main.js using previous builds as dictionaries, storing the delta-compressed version along with the SHA-256 hash of the dictionary used (e.g. as main.js.<hash>.br-d).

On a future visit to the site after the application code has changed:

The browser loads https://www.example.com/ which contains <script src="//static.example.com/app/main.js/125" crossorigin>.
The browser matches the https://www.example.com/app/main.js/125 request with the https://www.example.com/app/main.js* URL pattern of the previous dictionary response that is in cache and requests https://static.example.com/app/main.js/125 with Accept-Encoding: br-d,br,gzip, sec-fetch-mode: cors and Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:. For this example, the hash value from the header would need to be re-encoded as a filesystem-safe version of the hash before looking for the file (bas64-decode the header value and hen hex-encode the hash).
The server for static.example.com matches the URL and hash with the pre-compressed artifact from the build and responds with it and Content-Encoding: br-d, Access-Control-Allow-Origin: https://www.example.com, Vary: Accept-Encoding,Available-Dictionary, and Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=: response headers.

It could have also included a new Use-As-Dictionary: match="/app/main.js*" response header to have the new version of the file replace the old one as the dictionary to use for future requests for the path but that is not a requirement for the existing dictionary to have been used.

sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<script src="//static.example.com/app/main.js/125" crossorigin>...
Browser->>static.example.com: GET /app/main.js/125<br/>Accept-Encoding: br-d,br,gzip<br/>sec-fetch-mode: cors<br/>Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:
static.example.com->>Browser: Content-Encoding: br-d<br/>Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:<br/>Access-Control-Allow-Origin: https://www.example.com<br/>Vary: Accept-Encoding,Available-Dictionary

Site-specific dictionary used for all document navigations in a part of the site

In this example, www.example.com has a custom-built dictionary that should be used for all navigation requests to /product.

On the initial visit to the site:

The browser loads https://www.example.com/ which contains <link rel=dictionary href="/dictionaries/product_v1.dat">.
At an idle time, the browser sends an uncredentialed fetch request for https://www.example.com/dictionaries/product_v1.dat.
The server for www.example.com responds with the dictionary contents as well as use-as-dictionary: match="/product/*", match-dest=("document"), id="product_v1" and appropriate caching headers.
The browser caches the dictionary file along with a SHA-256 hash of the decompressed file and the https://www.example.com/product/* URL pattern, the document destination and the product_v1 dictionary ID.

sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<link rel=dictionary href="/dictionaries/product_v1.dat">...
Browser->>www.example.com: GET /dictionaries/product_v1.dat<br/>Accept-Encoding: br,gzip
www.example.com->>Browser: use-as-dictionary: match="/product/*", match-dest=("document"), id="product_v1"

At some point after the dictionary has been fetched, the user clicks on a link to https://www.example.com/product/myproduct:

The browser matches the /product/myproduct request with the https://www.example.com/product/* URL pattern of the previous dictionary request as well as the document request destination and requests https://www.example.com/product/myproduct with Accept-Encoding: br-d,br,gzip, Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=: and Dictionary-ID: "product_v1" request headers.
The server supports dynamically compressing responses using available dictionaries and has the dictionary with the same ID and hash available and responds with a brotli-compressed version of the response using the specified dictionary as well as Content-Encoding: br-d and Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=: response headers.

sequenceDiagram
Browser->>www.example.com: GET /product/myproduct<br/>Accept-Encoding: br-d,br,gzip<br/>Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:<br/>Dictionary-ID: "product_v1"
www.example.com->>Browser: Content-Encoding: br-d<br/>Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:

Changelog

These are the changes that have been made to the specs as it has progressed through various standards organizations and based on developer feedback during browser experiments.

Feb 2023

The Sec-Available-Dictionary request header changed to Available-Dictionary.
The value of the Available-Dictionary request header changed to be a Structured Field Byte Sequence (base-64 encoding of the dictionary hash, surrounded by colons) instead of hex-encoded string.
The content encoding string for brotli with a dictionary changed from sbr to br-d.
The match field of the Use-As-Dictionary response header is now a URLPattern.
The expiration of the dictionary now uses the cache expiration of the dictionary resource instead of a separate expires.
The server can provide an id in the Use-As-Dictionary response header which is echoed in the Dictionary-ID request header by the client in future requests.
The server needs to send a Content-Dictionary response header with the hash of the dictionary used when compressing a response with a dictionary (must match the Available-Dictionary from the request).
match-dest was added to the Use-As-Dictionary response header to allow for matching on fetch destinations (e.g. match-dest="document" and have the dictionary only be used for document requests).

compression-dictionary-transport's People

Contributors

Stargazers

Watchers

Forkers

horo-t pmeenan tomvangoethem flano-yuki jxck seanpm2001 matt-koevort hiway-media

compression-dictionary-transport's Issues

Dictionary partitioning in browsers without tripled-keyed HTTP cache

It might well make sense that this is layered on top of the HTTP cache as @pmeenan suggests, but not all implementations have a tripled-keyed HTTP cache at this point. Is that a pre-requisite for this feature?

See whatwg/fetch#1035 for some background and further pointers on triple-keyed HTTP cache.

No support for hash-based versioning

This somewhat falls under "open question 2" in the explainer, but I thought its worth opening an issue to discuss this specific aspect.

The problem
It is very common for asset paths to include a hash of the asset's content.

usually, concatenated with the asset name itself in some form - e.g. cdn.mysite.com/assets/myscript_HASH.js
less commonly, preceding the path - e.g. cdn.mysite.com/assets/HASH/myscript.js.
Most JS tooling/bundlers production builds generate assets with the hash concatenated to the asset out of the box.

The benefit: assets with no changes between version n to n+1 keep the same hash, and load from the HTTP cache.

The proposed path/scoping rules mentioned in the explainer do not support type of versioning.
I think this is bad:

If only myscript.js/{version}-esque scoping rules are supported, then there's a mutually exclusive choice between cache-friendly hash-based versioning, and support for delta dictionaries.
The tradeoff will be:

If you stick to hash-based versioning, you get "peak performance"[1] (load from disk) for cached assets and pay the full(compressed) price for assets that changed. This is the boat we're all in today because it's the only option.
If you move to number-based versioning, you'll always have to fetch from the network but it'll be minimal deltas.

Of course, there's is no clear cut "one is better than the other".
Factors such as code-splitting granularity, deployment cadence, and user demographic/perf distribution come to top of mind.

On a meta-level (i.e. from ya'll browser folks' point of view), it's possible that the aggregate performance impact will be net negative.
Being an "opt-in" feature where devs will chose delta dictionaries over hash-based versioning doesn't guarantee a net benefit in the long run. (They might A/B test today, but circumstances change after 1/2 a year).

I think Chromium might not be able to accurately measure the net impact even with an open A/B test origin trial due to selection bias of those who opt-in for the trial.
For a non-trial, there are many other factors that could affect performance over time. Looking at improved CWV for a short window of "before/after" might tell a lie in the long run and we'd never know.

Solution thoughts
I'm only here to complain..

Adding a wildcard anywhere in the path is a problem for the proposed scoping/pathing rules.
Is it ~better if it's only allowed for the slug (last segment)?
Maybe the slug can be a prefix by-definition? i.e. /myscript.js implicitly matches both /myscript.js.hash1, /myscript.js.hash2.

[1] - https://simonhearne.com/2020/network-faster-than-cache/

Grabbing authority for paths closer to the root

It seems that the proposal allows any subresource to essentially claim dictionary authority for /.

Am I misunderstanding or is that correct?
If so, can that be abused in some manner?
If so, this seems inconsistent with how we handled this with, e.g., service workers: https://w3c.github.io/ServiceWorker/#path-restriction.

Standards positions

Hi folks working on this! If you haven't already, could you please file standards positions for WebKit and Mozilla? It would be great to get a few more eyes on this from potential implementers:

Thanks in advance!!!

Dictionary expiration/lifetime

If dictionaries end up scoped to a path and use some form of precedence, what are the mechanics for expiring a dictionary with more specificity for a less-specific one?

i.e., assuming dictionaries that cover 2 paths:

A - http://example.com/web/products/
B - http://example.com/

If a client has both dictionaries but a site decides to unify on a single global dictionary (B), how is dictionary A replaced? Some possibilities come to mind:

When a dictionary with a path is fetched, the response can indicate if it overrides all child paths (deleting any dictionaries for paths under it).
Just use regular cache expiration runes and dictionary A remains preferred until its cache lifetime expires.

Consider support other Content-Encoding schemes

Hello. Could/should this specification be generalized to support it's application to other compression schemes? Currently the README seems exclusively focused on Brotli, but having wider support could help other standards.

For example there is the zstd compression scheme, here:
https://docs.google.com/document/d/1aDyUw4mAzRdLyZyXpVgWvO-eLpc4ERz7I_7VDIPo9Hc/edit . This too could use Dictionary support.

Concern over handling (or lack thereof) of dictionaries was one of the primary concerns cited in mozilla/standards-positions#105 for the defer status against the zstd compression scheme proposal. If this PR could be generalized a bit, there's a possibility of zstd & potentially other compression schemes to have a better chance, to making it forward & helping users save cpu & bandwidth on the web.

Zstandard Interpretation of Dictionary

Zstandard can use both structured and raw content dictionaries (RFC8878 sec. 5). When a buffer is presented to Zstandard to be used as a dictionary it must be instructed how to interpret it. (If a properly formatted dictionary is used as a structured dictionary by the compressor and as a raw content dictionary by the decompressor, or vice-a-versa, the reconstructed output will likely differ from the original content.)

The three options provided by zstd are:

Auto: See if the leading bytes match the magic for a Zstandard-formatted dictionary. If so, interpret it as a formatted dictionary. Otherwise, interpret it as a raw dictionary.
Raw: Ignore the header even if it looks like a structured dictionary. Use the buffer as raw content.
Full: Interpret the dictionary as a structured dictionary. Fail if it doesn't conform.

One option is to use the MIME type of the resource being used as a dictionary (as discussed in #44) to signal how it should be interpreted. But simpler might just be to use the auto-interpretation mechanism.

Whatever we choose, the description of the zstd-d content-encoding should be updated to be explicit about this.

Allow for hash/versions in the middle of the path

A lot of build systems produce static resources that are prefixed by a build number and that doesn't work well with a prefix-only match. i.e. /app/123/main.js

We could allow for more flexible path matching with some form of wildcard support but that will complicate the "most-specific" matching logic and the ownership protections.

Using a # for a wildcard (since it is already reserved as a client-side separator) we could allow for exact matching by default, prefix matching with a # at the end or wildcard matching.

Some open questions:

How would the specificity be ranked if there are multiple matches? Overall match string length feels like it should be safe but would it lead to unexpected results?
Does this open up ownership/scope issues for shared-host environments? Maybe, at a minimum, require the first path component to be specified when using wildcards if the dictionary is not served from the root path?
Does it need to support multiple wildcards? At a minimum, allow for 1 in the middle and one at the end?

Supporting "no-cors" mode requests seems problematic.

When a browser fetches a cross-origin script (eg: <script src='https://static.example.com/script.js'> in https://www.example.com/index.html) , it sends a request with the mode set to no-cors and the credentials set to include.

The current explainer allows this type of request for both registering as a dictionary and using a registered dictionary for its decompression, as long as the response header contains a valid Access-Control-Allow-Origin header (* or https://www.example.com).

However, if we follow the CORS check step in the Fetch spec, the response header must also contain the Access-Control-Allow-Credentials: true header, and Access-Control-Allow-Origin header must be https://example.com. This means that the server must know the origin of the request, even though the request does not include an Origin header. (It may include a Referer header. But the Origin header and Referer header are conceptually different.)

For this reason, now I think supporting no-cors mode requests is problematic.
Maybe we should support only navigate, same-origin, and cors mode requests?

@pmeenan @yoavweiss
Do you have any thoughts?

i.e. vs e.g.

(i.e. as main.js..sbr)

Did you mean to use e.g. here?

Automatic retry fetching on cached dictionary read failure

Can we make the browser automatically retry the request without the sec-bikeshed-available-dictionary: header when it failed to read the cached dictionary?

The current explainer says:

In case the browser advertized a dictionary but then fails to successfuly fetch it from its cache and the dictionary was used by the server, the resource request should be terminated

So the browser must check the existence of the cached dictionary on the disk before sending the request to reduce the risk of such failure.

If the automatic retry is allowed, the browser can speculatively send the request with sec-bikeshed-available-dictionary: header without checking the cached dictionary. I think this is very important for performance.

Same-origin check, redirects, and navigations

We should make sure the correct thing is done here, to avoid confused deputy attacks.

(This came up during TPAC 2023 and nobody present was immediately clear on whether this was handled correctly.)

Path parsing

Is this defined somewhere?

Consider making sec-available-dictionary: value path-safe

As currently spec'd, the sec-bikeshed-available-dictionary: request header is a structured field dictionary that includes the hash type and base-64 encoded hash of the dictionary file.

i.e. sec-bikeshed-available-dictionary: sha-256=:d435Qo+nKZ+gLcUHn7GQtQ72hiBVAgqoLsZnZPiTGPk=:

On the server side, it would be extremely easy to check for and serve delta-encoded resources if the hash was part of the file name. i.e. /app/main.js.sbr.<hash>.

Extracting the hash from the SF value and mapping it to a hex string or other path-safe string can be done but is maybe a bit more complicated than it needs to be.

Since the length of the hash string changes by the hash type we can send the hash without having to send the algorithm (just need to make sure all supported algorithms generate different hash lengths). Additionally, Base64 includes / as one of the characters to use when encoding so it may be cleaner to just use hex encoding. Other higher-but-safe bases could be selected as well but may complicate tooling.

If we change it to use the base-16 encoded hash and send the raw hash as the value then the server or middle boxes can construct the file name directly by appending the header value to the end of the file path (though some care should be taken to make sure it isn't abused for a path attack and that the value appended only contains valid characters).

Consider options for Path of side-loaded dictionaries

For dictionaries loaded from a Link: header, it could be useful for the request that triggers the dictionary fetch to either specify the scope of the dictionary or for the allowable path for the dictionary to include the path from the original request and from a document <link> tag to also provide other path options.

The path restrictions for dictionary use as they are currently written are for providing some level of ownership proof when setting the scope. The request that triggers the dictionary fetch and the document itself are also proof points and could allow for serving the dictionary from a different directory than the resources it is intended to be used with (still needs to be same-origin as the resources).

Copy edit issue

The README.md says:

On a future visit to the site after the application code has changed:

The browser loads https://www.example.com/ which contains <script src="//static.example.com/app/main.js/125">.
The browser matches the /app/main.js/125 request with the /app/main.js path of the previous response that is in cache and requests https://static.example.com/app/main.js/123 with Accept-Encoding: br, gzip, sbr, sec-fetch-mode: cors and sec-bikeshed-available-dictionary: <SHA-256 HASH>.
The server for static.example.com matches the URL and hash with the pre-compressed artifact from the build and responds with it and Content-Encoding: sbr, Access-Control-Allow-Origin: https://www.example.com, Vary: Accept-Encoding,sec-bikeshed-available-dictionary.

I believe it should say:

The browser matches the /app/main.js/125 request with the /app/main.js path of the previous response that is in cache and requests https://static.example.com/app/main.js/125 with Accept-Encoding: br, gzip, sbr, sec-fetch-mode: cors and sec-bikeshed-available-dictionary: <SHA-256 HASH>.

Add cross-origin compression protection

To add another layer of defense against cross-origin timing attacks, we should add language along the lines of:

When the server receives a sec-bikeshed-dictionary-available: sha256=:<hash>: request that includes an authority or origin as well as a referer request headers and where the referer is cross-origin, the dictionary may only be used for compression if the response headers includes an Access-Control-Allow-Origin: that includes the origin from the referer header.

It could be tweaked to use different sec-* headers to detect the cross-origin nature of the request but the requirement is to prevent servers from even sending responses using dictionary compression that should be opaque (and opening up the possibility of a timing attack).

What's the expected interaction model with Service Workers

Apologies if there are some specifics that I missed around this but I'm curious how service workers will interact with this solution. It's clearly at a lower layer and no API for SW but is it expected that, when using SW to make fetch requests, this process still happens or should it be skipped? Atm, typical browser caching layers are skipped with SW networking - for example, responses sending a etag header will not go through the process for which the request will have a if-none-match header, the SW needs to incorporate that.

URL matching should use URLPattern

This is the new foundation we're using for URL matching across the web platform. https://github.com/WICG/urlpattern

Introducing a new type of pattern is counterproductive to our efforts. (I can't find the details from the explainer, but it says "This is parsed as a URL that relative or absolute URLs as well as * wildcard expansion.", and then #42 is open also I guess.)

Escape character and ? for URL matching

In Chromium, we are using the MatchPattern() method to process the URL-matching.

The MatchPattern() method supports both ? and *. (? matches 0 or 1 character. And * matches 0 or more characters.) Also the backslash character (\) can be used as an escape character for * and ?.

The current proposal's Dictionary URL matching doesn't support \. Also it doesn't support ?.

I think ? is useful. But ? is used in URLs before URL-query string. So I think we should support both ? and \.

Consider Websocket use case

Websockets themselves would fail a same-origin check for a dictionary delivered over HTTPS.

Would it be valuable (and safe) to allow for the path matching URL in the dictionary response to specify a wss:// scheme along with a match path (and explicitly restrict dictionaries to https, not just same-origin)? Then the dictionary-setting part of the spec could require that the match path be same-origin (and https) or the equivalent origin if wss was used as a scheme in the match path.

Something like:

Only process use-as-dictionary: response headers for requests with a https scheme.
Parse the path (or match if we change it) param as a URL.
- Usually the path will be relative since the origin is not needed but could be used if specifying wss (and doesn't hurt otherwise, allowing regular URL parsing and classes to be used).
Allow https and wss schemes if the URL is fully-qualified.
Verify that the origin for the request URL and match URL are the same.
- If the match URL uses a wss scheme, replace it with https when doing the origin comparison.

AFAIK, the actual compression should work fine for data delivered over a websocket as long as the encoding supports streamed compression (which is usually a requirement before adopting a new compression algorithm anyway).

Provide mechanism for A/B testing

One of the things that came up during Chrome's origin trial is that A/B testing the effectiveness of compression dictionaries is difficult (and will become more difficult when it is no longer an origin trial).

There are 2 points in the serving flow where dictionary decisions need to be made:

On the original request when the use-as-dictionary response header is sent to mark a response as an available dictionary.
On a subsequent request when the client advertises available-dictionary and the server decides if it is going to serve a dictionary-compressed response.

In the case of the origin trial, there is a third gate which is the setting of the origin trial token which enables the feature (without which the use-as-dictionary response header will be ignored. Outside of the origin trial there is no page-level gate for enabling it and in both cases, once enabled, there is no way to turn it off for individual users.

For the dynamic use case where the server is running application logic anyway and the response is not coming from a cache, it is possible to use a cookie or some other mechanism to decide if dictionaries should be used, both on the initial request and subsequent requests where the available-dictionary request can just be ignored.

In the static file use case where resources are served from an edge cache and the cache keys the resources by URL, accept-encoding and available-dictionary, there is no granular way to control user populations. All clients for a resource will get the use-as-dictionary response header and all clients that advertise a given dictionary would get the dictionary-compressed response. The page does have SOME level of control but it would require using different URLs for the resources for the different populations.

Counter-points

While it would be useful for sites to be able to have granular control over the feature for measuring the effectiveness during roll-out, that level of control is not usually exposed for transport-level features.

Other content encodings have the same restrictions, including brotli and ZStandard as they were rolled out.
As mentioned above, it is difficult but not impossible to test by using different URLs for different populations (though this is more difficult if you don't control the page where the URLs are embedded).
Allowing for a global enable/disable capability would potentially expose 1 bit of fingerprinting data across privacy boundaries.
This is only for A/B testing, at a global level there are already controls that allow for the feature to not be used in case of a catastrophic problem (either by browser flags for the browser manufacturer to disable or by ignoring the available-dictionary request headers).

Define mechanism for advertising non-bytestream dictionary formats

Brotli and Zstandard both support raw byte streams as well as "optimized" dictionaries. Most of the work to this point has assumed raw byte streams but it would be beneficial to spec what the negotiation for a custom dictionary payload would look like so that backward-compatibility doesn't become a problem.

i.e. If a browser ships without support for extended brotli dictionaries or index-based Zstandard dictionaries and support for both is added at a later time, we need to make sure that older clients will not break by trying to use the new dictionary as a raw byte stream.

This could be done with different content-encodings for the different types of dictionaries but it would be better to not explode the set of encodings if it isn't necessary.

One possibility that comes to mind:

Define separate content-type's for different types of stand-alone dictionaries. i.e. dictionary/raw, dictionary/brotli, etc.
When stand-alone dictionaries are fetched using the link rel=dictionary mechanism, Advertise the supported dictionary times in the Accept: header.
When responding with the use-as-dictionary response header, add an optional field for type= for the type of dictionary that defaults to type=raw.
When responding to a stand-alone dictionary fetch, respond with the proper mime type for the stand-alone dictionary in the content-type header.
If a client doesn't recognize the type specified in the use-as-dictionary response header then it should not store the dictionary (independent of how it was fetched).
(optional) if the client is processing a stand-alone dictionary fetch and the content-type response header is not a recognized dictionary type then it should not be stored as a dictionary.

Since custom dictionaries will only ever make sense to be fetched as stand-alone dictionaries, this should allow for backward-compatibility as new dictionary formats are created.

Clear Site Data for dictionaries

There is no way to delete registered dictionaries.

I think we should support it using Clear Site Data. The Clear Site Data spec defines following types.

"cache"
"cookies"
"storage"
"executionContexts"
"*"

I think Web developers will want to delete dictionaries without deleting other types ("cache", "cookies", "storage"). So we should introduce a new type "dictionaries".

Clear-Site-Data: "dictionaries"

Content-encoding may be fragile

Content-encoding is the most natural fit for the actual compression but it is likely to also cause adoption problems, at least in the short term.

It's not unusual for the serving path to consider content-encoding to be per-hop instead of end-to-end from the browser to the origin and unless the delta-encoding is being done by the leaf serving node, the sbr encoding is likely to be stripped out.

sequenceDiagram
Browser->>CDN: Accept-encoding: sbr, br, gzip
CDN->>Origin: Accept-encoding: gzip
Origin->>CDN: Content-encoding: gzip
CDN->>Browser: Content-encoding: br

If the actual encoding is done using other headers for negotiation but the content-type remains the same, then the compressed resources will be binary data and may cause other issues for middleboxes (i.e. with something like edge workers, they will be expecting to be processing text HTML, CSS or Javascript payloads). That could be workable for a given origin as long as they control the processing along their serving path.

One deployment model where it could work, but requires explicit support from both origins and CDNs is:

Origin manages the bikeshed-use-as-dictionary: <path> response header
CDN sees the header and stores the resource in a custom-dictionary cache with the appropriate hash
Browser requests with sec-bikeshed-available-dictionary: and Accept-Encoding: sbr, br, gzip
CDN checks cache for delta-compressed artifact for combination URL and dictionary
On miss, CDN checks for cached uncompressed URL (or fetches from origin on full miss)
If response is not already a delta-compressed artifact:
- Check cache for requested dictionary
- Compress response with requested dictionary (possibly as a background task for future requests)
- Serve response to browser with Content-Encoding: sbr

Content-Type (MIME) of Dictionary

I found that there are no definition of MIME type for dictionary itself in demo.

application/compression-dictionary maybe ?

Use case for TTL decoupled from cache freshness

For the case of dynamic HTML resources, I can see sites with low number of returning visitors, where it can be beneficial to e.g. reuse the HTML delivered as part of the current page for future versions of the same page or for similar pages (e.g. reuse the HTML from one product page to another).

But very often, such HTML pages (especially with publishers and e-commerce) are served with very low caching freshness lifetime (if any), to ensure that typos or page errors won't live on in the browser's cache.

At the same it'd be great to be able to use these pages as a dictionary for a long while.

So it'd be great to be able to define both Cache-Control max-age and a dictionary TTL, have the browser cache keep the resource around for the duration of the longest amongst the two, but only use it for the case for which it is still fresh.

sec-fetch-dest for dictionary fetch

In the current explainer, when the browser detects a link element <link rel=bikeshed-dictionary as=document href="/product/dictionary_v1.dat">, it fetches the dictionary with sec-fetch-dest: document header.

However, when the server receives the request, it may be confused whether this is an actual document request for navigation or a dictionary fetch request.
Therefore, I want to recommend to introduce an appropriate sec-fetch-dest value to indicate that the request is for a dictionary fetch.

Two possible ideas are:

sec-fetch-dest: dictionary and sec-fetch-dict-dest: document
sec-fetch-dest: dictionary-for-document

In Chromium implementation, the "document" destination type is used to detect the main resource request. Therefore, introducing a new destination type is also convenient for Chromium developers.

Exposing storage usage for dictionaries

I'm wondering whether we should expose the storage usage for the dictionaries.

Currently Storage API is providing a way to get the storage usage.

For example in Chromium,

JSON.stringify(await navigator.storage.estimate(), null, 2);

returns

{
  "quota": 296630877388,
  "usage": 75823910,
  "usageDetails": {
    "caches": 72813056,
    "indexedDB": 2877379,
    "serviceWorkerRegistrations": 133475
  }
}

Note: usageDetails was launched in Chromium. But it is still under spec discussion.

I have two questions:

Is it OK to increase the usage for dictionaries?
Is it OK to introduce dictionaries in usageDetails?

All dictionary resources should be readable from the page, so I don't think there is any risk of exposing them. But I'd love to hear other options.

Requires all caches in the path support Vary:

It's probably worth calling out that all caches in the serving path will need to support Vary: sec-bikeshed-available-dictionary so that the cache for a given URL doesn't get polluted with delta-compressed artifacts using different dictionaries.

Full Vary support for arbitrary headers isn't necessarily needed but it will be required for whatever the dictionary request header ends up being.

Not sure if it needs specific mentioning, but this is for the CDNs, Load balancers and web servers at a minimum, depending on what caches are in the path for a given origin.

I'm assuming it also needs to be limited to HTTPS (and maybe only HTTP/2 and 3) to reduce the risk of forward proxies or intercepting man-in-the-middle proxies from causing cache issues.

expires or max-age

https://github.com/WICG/compression-dictionary-transport/blob/main/README.md?plain=1#L137

expires - Expiration time in seconds for the dictionary.

in Cookie, Cache-Control etc, expire is date format and max-age is time in seconds.
so it seems bit strange for me to have time in seconds for expires.
how about make it max-age ?

Abbreviated structured field dictionary keys

Why is it p and not path? Same for the other fields.

cc @mnot

Hashes, algorithm agility, and overlap with HTTP digests.

The explainer describes that the client/and server generate SHA-256 hashes and then use those to coordinate. Is there a specific reason why algorithm agility is not built in to the protocol? In simple terms, the ability to migrate to other algorithms as the security environment evolves.

The more I look at this aspect, the more it gets me thinking about whether the design has some overlap with the HTTP digests specification https://httpwg.org/http-extensions/draft-ietf-httpbis-digest-headers.html

The explainer hints at wanting to constrain the size of the sec-bikeshed-available-dictionary field value via

SHA-256 hashes are long. Their hex representation would be 64 bytes, and we can base64 them to be ~42 (I think). We can't afford to send many hashes for both performance and privacy reasons.

but I wonder how much this really matters in practice.

If we adopted a similar approach that digests use, you could make sec-bikeshed-available-dictionary be a Structured Fields dictionary that can convey 1 or more hash values alongside their indicated algorithm e.g.

sec-bikeshed-available-dictionary:
  sha-256=:d435Qo+nKZ+gLcUHn7GQtQ72hiBVAgqoLsZnZPiTGPk=:,
  sha-512=:YMAam51Jz/jOATT6/zvHrLVgOYTGFy1d6GJiOHTohq4yP+pgk4vf2aCs
  yRZOtw8MjkM7iw7yZ/WkppmM44T3qg==:

Even if you restrict to only adding one hash, you can still benefit from agility via sending the algorithm

Allow for dest matching in addition to URL pattern

Allow for matching against fetch destination (Sec-Fetch-Dest) in addition to the URL pattern.

Maybe an optional match-dest dictionary entry on the Use-As-Dictionary response and require it be a full match against the specified fetch destinations.

URLPattern usage

Colleagues and I were curious if Chromium had any plans it could share around whatwg/urlpattern#191. As running JS regular expressions in networking doesn't seem like it will fly.

Is the idea to have some kind of safe subset?

cc @domenic @pmeenan @cdumez

A "perfect match" scenario

Hey!

Imagine an Edge based deployment of compression dictionaries, where the resources themselves are in a cloud-based storage.
Every time the CI runs, it adds a new resource to the pile, and calculates the diffs between it and N previous versions of that same resource. All of these diffs are stored in the same bucket in the cloud.

Now, whenever a resource is served, it uses a use-as-dictionary value that matches the various resource versions.
What happens when that same resource gets reloaded?

Its matches value definitely matches itself, so it's getting a request with a SHA-256 signature in its sec-available-dictionary with its own signature. That kind of 0 sized diff does not exist in the cloud storage, because the CI didn't create diffs from the resource to itself. That means the request either fails, or is retried without the dictionary. (adding delay)

What's the right way to tackle such a scenario?

One option would be to provide some signal on the request that the SHA in sec-available-dictionary is of an exact match of the URL. That would enable the edge to do something smarter about this than to fail and retry.
Another option would be for such deployments to store a "diff" from the file to itself, and unify these flows without retries. At the same time, it feels odd to add such diffs.

I'd love thoughts on the right thing here for the protocol (and developer advice that will be derived from it).