wicg / content-index Goto Github PK

Explainer and spec for the Content Indexing proposal

Home Page: https://wicg.github.io/content-index/spec/

License: Other

content-index's Introduction

Content Indexing

Written: 2019-04-29
Updated: 2019-05-08
Spec: Draft

High quality offline-enabled web content is not easily discoverable by users right now. They would have to know which websites work offline or install a PWA to be able to browse through content while offline. This is not a great user experience as there is no central point to discover available content. To address this, we propose a new API to allow developers to tell the browser about their specific content.

The content index allows websites to register their offline enabled content in the browser. This allows the browser to improve their offline capabilities and offer content to users to browse through while offline. This data could also be used to improve on-device search and augment browsing history.

Why do we need this?

Unrealiable or even unavailable network connections are very common on mobile devices. Even in connected cities like London, people have very limited connectivity while travelling underground. This API would allow browsers to show meaningful content to users in these situations and sites to increase user engagement.

Browser vendors are already looking for content relevant to the user, based on their browsing history, and make it available to be consumed offline. This is not ideal as it ignores the entity with the most knowledge of that content - the providers themselves. With this API they can highlight user specific, high quality content through the browser. Grouping content by a category (e.g. 'article', 'video', 'audio') allows an even richer experience as the browser is able to understand what kind of content is available and show a relevant UI.

Usage scenario 1

A news publisher has a website that uses service workers to allow its users to read news articles offline. Highly engaged users of this website may see a link to the site in their browsers home screen, but have no way of knowing if there are any new articles available to read beforehand. The news site can leverage web notifications for high priority breaking news articles, but should not use them for less important ones. By using this API, the news site can simply expose its content to the browser which can then surface that content to the user. Users can then browse available content in a central location, even while offline.

Usage scenario 2

A blog publishes regular podcasts to its users. It is available as a PWA and uses background fetch to download the audio files. An embedded media player then allows users to listen to these podcasts. With this API, these podcasts can be surfaced in the OS, allowing users to search for their downloaded content through native UI surfaces. This integration is only available with native apps at the moment.

Goals

Allow users to easily find content even while offline
Surface high quality content in native spaces (example: Android Slices)

Non-goals

Storage of the offline content itself
We expect developers to use more specialized APIs to store content (see Service Worker Caches or Web Storage).

Broader API landscape

Service Worker

We propose to add this API as an extension to Service Workers. This allows browsers to check if the given content is actually available offline. This also makes it easier for developers, as the entries get removed automatically if the service worker is unregistered (and therefore can not provide the offline content anymore).

CacheStorage API

The CacheStorage API allows websites to cache requests and, for example, use the cached content in the fetch event of a service worker. This makes it easy to ensure that some content is available offline, and is one of the steps to create high quality Progressive Web Apps.

Web Packaging

Web Packaging is a proposed API to bundle resources of a website together, so they can be shared offline. This also allows them to be securely distributed as a bundle. This API plays nicely together with Content Indexing, making it easier to ensure all necessary content is available offline.

Security and Privacy

Developers have control over which content they want to make available to the browser. The lifetime of an entry in the content index is comparable to that of Notifications, but with a less intrusive UX and more structured content. When adding personalized content, websites can simply remove entries on logout (and close all open Notifications). The storage required to store the entries of the index itself count towards the quota of the origin. This document contains additional answers of the security & privacy questionnaire.

Abuse Considerations

Browsers surfacing content can lead to increased exposure to a website, which would lead to increased profit for said website. This might cause malicious/spammy websites to aggressivley register offline-capable content in order to maximize their chances of exposure. The spec does not enforce a cap on how many content index entries can be registered, nor does it enforce that the content should be displayed. Therefore, it is important that browsers to choose the appropriate content to surface at the right time.

That remains to be an implementer choice, but some useful signals browsers can use are:

The user's engagement with the website
The freshness of the content
The user's engagement with other surfaced content from the website

Browsers can also apply a per-origin cap for the content to surface, and penalize websites that don't clean-up expired content and user-deleted content.

Examples

Please see this separate document for the proposed WebIDL additions.

General usage

// Add an article to the content index
await swRegistration.index.add({
  id: 'article-123',
  title: 'Article title',
  description: 'Amazing article about things!',
  category: 'article',
  icons: [
    {
      src: 'https://website.dev/img/article-123.png',
      sizes: '64x64',
      type: 'image/png',
    },
  ],
  url: 'https://website.dev/articles/123',
});

// Delete an entry from the content index
await swRegistration.index.delete('article-123');

// List all entries in the content index
const entries = await swRegistration.index.getAll();

Combined with other APIs

Sending breaking news articles via Push API allows websites to keep their users up to date. Adding these articles to the content index allows the browser to highlight them and make them discoverable later on. In this example we make use of the CacheStorage API to cache content resources, and the Indexed Database to store the structured content.

async function handlePush(data) {
  // Fetch additional data about pushed content
  const news = await fetch(`/api/news/${data.id}`);

  // Store content in database and cache resources
  await Promise.all([db.add(news), cache.add(news.icons[0].src)]);

  // Add content to content index
  if ('index' in self.registration) {
    await self.registration.index.add({
      id: news.id,
      title: news.title,
      description: news.description,
      category: 'article',
      icons: news.icons,
      url: `/news/${news.id}`,
    });
  }

  // Display a notification
  return self.registration.showNotification(news.title, {
    tag: news.id,
    body: news.description,
    icon: news.icons[0].src,
  });
}

// Handle web push event in service worker
self.addEventListener('push', event => event.waitUntil(handlePush(event.data.json())));

// Handle content deletion event in service worker.
// This is called when a user (or useragent) has deleted the content.
self.addEventListener('contentdelete', event => {
  event.waitUntil(Promise.all([
    // Delete cache & DB entries using `event.id`.
  ]));
});

When used together with the proposed Periodic Background Sync API, this allows websites to automatically sync fresh content and make it available to users.

// Add an article to the content index
function addArticleToIndex(article) {
  return self.registration.index.add({
    id: article.id,
    title: article.title,
    description: article.description,
    category: 'article',
    icons: article.icons,
    url: '/articles/' + article.id,
  });
}

// Fetch new content, cache it and add it to the content index
async function updateLatestNews() {
  const latestNews = await fetch('/latest-news');
  // TODO: cache content
  if ('index' in self.registration) {
    await Promise.all(latestNews.map(addArticleToIndex));
  }
}

// Handle periodic sync event in service worker
self.addEventListener('periodicsync', event => {
  if (event.registration.tag === 'get-latest-news') {
    event.waitUntil(updateLatestNews());
  }
});

Alternatives considered

Extending the Cache interface

One of the requirements for this API is that the exposed content is available offline. In the case of an article this could be implemented by simply adding the response to the Service Worker Cache. We could extend this API to specify that specific cached entries can be exposed to the user.

This would limit some use cases as new content would have to be served from a server. When using the content index, developers could generate and store content offline and then add it to the index, making it available without any network connection.

References

Many thanks to @beverloo and @jakearchibald for their ideas, input and discussion.

content-index's People

Stargazers

Watchers

Forkers

jeffposnick beaufortfrancois qls0ulp tomayac foolip global-localhost global19 global19-atlassian-net autokagami nena2030 seanpm2001

content-index's Issues

Review feedback

They would have to know which websites work offline or install a PWA to be able to browse through content while offline.

It isn't clear to me where the "or" operates here. Is it: They would have to know which websites

work offline
install a PWA to be able to browse through content while offline

If so, the second point doesn't seem right. Websites can't install themselves onto the homescreen.

there no entry points

Isn't the homescreen icon the entry point?

deleteArticleResources in example 1 references payload.id, but it should just be id.

Could just be my preference, but I find async functions easier to read than promise chains. Eg, for the push listener:

self.addEventListener('push', event => {
  const payload = event.data.json();

  // Fetch & store the article, then register it.
  event.waitUntil(async function() {
    const articlesCache = await caches.open('offline-articles');
    await articlesCache.add(`/article/${payload.id}`);
    await self.registration.index.add({
      id: payload.id,
      title: payload.title,
      description: payload.description,
      category: 'article',
      icons: payload.icons,
      launchUrl: `/article/${payload.id}`,
    });
    // Show a notification if urgent.
  }());
});

I'm finding "display" a little confusing. Since it requires an environment, it feels like this UI can't be shown any time other than during the initial add() call.

If something is displayed for each item added, will this lead to the user being bombed with UI? Eg, if I download "today's news stories", which compromises of 20 stories, will the user suddenly get 20 notification-like things?

It’s RECOMMENDED that the user agent fetch the icons when the content is being registered, and stored to be accessed when needed.

I would make this specific. Spec when the icon should be downloaded, and what should happen if that download fails.

The whole icon-downloading bit could be behind a 'may', but it should be clear how the UA should behave if it chooses to get the icon.

The UI SHOULD provide a way for the user to delete the underlying content exposed by the UI

This kinda sounds like we can guarantee that the underlying content will be deleted. From a UI point of view, a button may be provided which, when activated, runs delete a content index entry for entry.

If either of description’s id, title, description, or launchUrl

Nit: 'any' rather than 'either'.

The content categories seem a bit limiting, especially as they're required. Eg, a book isn't really a "homepage" or an "article". Is a photo gallery an "article"? What about a daily crossword? Etc etc.

What's the reasoning behind requiring a category?

add and getDescriptions don't seem to mirror each other in terms of naming. I'd expect addDescription and getDescriptions, or add and getAll.

Let launchUrl be the result of parsing description’s launchUrl

"parsing" is linking to the wrong thing.

As you mentioned, a new registration may be introduced with a narrower scope, that would receive navigations for content items 'owned' by another registration.

Calling add twice with a ContentDescription with the same ID is racy. You could solve this with some sort of queue, eg https://html.spec.whatwg.org/multipage/infrastructure.html#parallel-queue. Dunno if it matters.

Right now, the promise returned by add is delayed by calling "display entry", which includes fetching icons. Is that deliberate?

For activating the content index entry, look at https://w3c.github.io/ServiceWorker/#clients-openwindow - it shows how to create a top level browsing context and navigate it.

The delete event provides the ID of the resource, but should it provide the whole content description?

Due to race conditions, it's possible for:

User clicks delete on entry with ID 'foo'.
contentdelete event for 'foo' queued.
New item with ID 'foo' added.
contentdelete event for 'foo' fires.

Privacy implications

It bears explicitly noting that offline content registrations are like history in how they make revealing browsing history easy, rather than like cookies/etc which require work to leak browsing history. These should be cleared whenever history is cleared, not just when a total site data purge happens.

While this isn't exactly new with this spec - websites can track your browsing history and re-display it - it does make this sort of leakage a lot more likely, and in particular it's likely to be shown by the browser, rather than the website itself.

Why have both id and launchUrl?

What's the point of the id field? Why isn't (the absolutised value of) launchUrl a primary key? If it's not and I'm misunderstanding something, maybe an example should be shown where there are non-unique launchUrl values.

Should access to the index be restricted to top-level contexts?

In other words, should an iframe have the ability to register content?

Error handling for delete()

Currently there is no feedback if the id passed into the delete method doesn't exist in the Content Index.

I'm unsure of what type of exception this should throw, but doing something like:

try {
	registration.index.delete("something");
} catch (e) {
	console.log('Failed to remove content', e.message);
}

would be good programming™️

launchUrl is incorrectly capitalized

It should be launchURL, per https://w3ctag.github.io/design-principles/#casing-rules and https://url.spec.whatwg.org/#url-apis-elsewhere

Content Indexing & Media Feeds

There's some overlap between the Media Feeds API and Content Indexing.

The APIs seem to be solving different problems though. For one thing Content Indexing is more geared towards offline pages where the type can be hinted to the browsers, whereas Media Feeds is targeting video media with a more rigid type breakdown that varies from type-to-type.

I looked a bit into potentially merging the APIs, but there doesn't seem to be a nice way of doing that so that both the APIs' goals can be achieved. It also seems to me that this is a bit like CacheStorage and IndexedDb, which are somewhat similar, but both can simultaneously co-exist as they are different tools for different nails.

I'd like to get some further thoughts from people involved, since this is something that will likely come up in the standardization track.

@beccahughes - who's working on the Media Feeds API
@jeffposnick - who's aware about both APIs

Should developers provide a size for their registered content?

It would give users a better idea of how much space they would get back upon deleting an entry, and makes for better UI.

Does icon validity check trigger a SW's fetch handler?

This might be less of a bug against the spec and more of an implementation detail related to Chrome 80's current behavior.

registration.index.add() currently rejects if you pass in an icons[].src value that isn't a valid URL. I'm curious about how this determination is made, as it's making it difficult for me to accomplish something while trying out the Content Indexing API. (Sample code.)

I've got a PWA that handles incoming media sharing requests using the Web Share Target API on Android.

If a user shares an image to my PWA, that image gets saved locally using the Cache Storage API with a cache key URL that doesn't exist on a remote server—requests for that URL will only succeed if intercepted by my service worker's fetch event handler, which will bypass the network and return the cached media resource.

I am calling registration.index.add() inside of the same fetch handler that is responsible for handling the incoming POST request from the Web Share Target API. If I pass in an icons[].src value corresponding to a generic icon URL that exists on the remote server, everything works as expected. However, if I pass in a icons[].src value that refers to the newly-cached image (which, again, is only valid when intercepted by the service worker, and doesn't exist on the remote server), the add() call rejects due to an invalid icon.

I can probably refactor things so that the call to registration.index.add() happens outside of a fetch handler, if that's what's causing the failure. But my bigger question is whether the validity checks for icons is supposed to trigger a service worker's fetch handler at all—because if it doesn't, I've got a bigger issue to solve.

Verify that content is really available offline

What about sites that register things that are not actually available offline?

oncontentadded event

Hello. At the moment, if I have a complex service worker that wants to take notice of content being added, I can either

do that additional work everytime I call index.add()
use setInterval() or some such & call index.getAll() & diff, checking for new content

Since we have this index, & this index already can tell us when content goes away, it would also be nice to know when content is added. This would be a more normalized path than (1) or (2).

Activating a Content Index entry

The spec draft implies that a new tab should be opened when the UI is activated. Is that the right thing to do? Should this be more flexible to allow for browser-specific implementations?

Websites can give usefulness hints

The website may have a very good idea of what resources the user will want to access - e.g., the next chapter of the book they're reading, rather than some randomly selected one according to general-purpose browser heuristics.

Giving the website a way to indicate likely usefulness for registered and suggesting (but of course not requiring) the browser uses that as an input would likely be a good idea. Note that this should only affect the ordering of resources shown within a single registrable domain.

Concretely, this might look like a real-valued weight field being added.

This is also distinctly optional and not needed for a MVP, but I do think it could be quite useful.

Internationalisation of text fields

The text fields (title and description) are monolingual and in an unspecified language. Even if multiple languages + browser selection of which one to display would be overkill, the language should be indicable to allow proper display (as matters for, e.g., CJK).

Fire a contentdeleted SW event

If a user decides to delete the content, the browser should fire a contentdeleted event with the ID so developers can clean up the underlying content.

Some thought should be put into preventing malicious websites from re-adding the same content with a new ID within that event.

Why only one icon type/size? (look at web app manifest)

Renaming the spec

Some people have raised concerns that the name of the API is confusing. Is there a better name for this?

Offline Metadata has been thrown around.

Overlap with Web Manifests

What's the relation between this spec and Web Manifests?

There seems to be an awful lot of overlap, with the major differences being:

Web Manifests collect many resources together (e.g., an entire newspaper website), while Content Index applies to individual resources (e.g., single newspaper articles).
Web Manifests are discovered declaratively, while Content Index is used via scripting. However, both still need scripting to actually install the service worker for offline use. This allows Content Index to be kept in sync with the state of the offline cache, so that only items that are actually available (and useful) offline are shown to the user.
Web Manifests don't have a way to explicitly indicate that they're useful offline; while many will be, there will inevitably be some that strictly need the network (e.g., real-time multiplayer games).

Some sort of compare-and-contrast probably belongs in the spec, as well as notes on why extending Web Manifests (e.g., with an offline boolean field) wouldn't be as good.

I also think that it could lift some design directly from the Web Manifest spec, which has basically the right approach to responsive icons (#2) and, less perfectly but still better, to categories (#7) and i18n (#6).

Need more content metadata to able sort or filter it from the list

Talking about content, it will be great to have more content metadata to able show it properly in the list. For example created time and updated time will be useful to sort the content after we pull it from the list.

WordPress Post object will be a good reference for available content metadata.

How to index a Web Packaging package?

One of the items mentioned in the WICG proposal post is the interaction with Web Packaging. I'm very interested to know more about what this interaction would look like.

Let's start with this scenario:

I am a comic book site comic.yoyodyne.example. I have a PWA main site and distribute Web Package bundles with free comics.
My customers Jane and Karen are offline on a subway ride. Both have the PWA.
Karen sends Jane the latest Fantastic Overthrust Oscillators (FOO) bundle
How does Jane's PWA find out about the bundle & it's content?

At the moment, I have only one path I can think of for this to work. My thought is that the FOO Bundle could have it's own index.html that has embedded within it an explicit, hardcoded list of all the other files also in the Web Bundle. FOO Bundle would step through this hardcoded list of content, & index.add() each item. Once done, it could redirect to the main PWA url.

The constraint is that currently Karen or Jane would have to know about the Bundle's unique index.html file, & know how to navigate there to initialize this problem. It's also a kind of gross solution any how, because the index.html file has to have some JS with the hardcoded list of content that's in the bundle.

What I would love to see would be a way for content to more easily declare itself as indexed. As a secondary objective, HTTP Push has almost the same problem, where the page/sw have no way to know about PUSHed content. There, a similar approach is also hacked together: use SSE or use WebSockets or some such to tell the browser about the content you have just PUSHed to it, so you can fetch() then cache that content. That issue is whatwg/fetch#65.

It would be really lovely to have a way to get content into the content index effectively. I would love for my Comic web app to be able to find out about the Comics it is being sent. Content-Index seems like it could be a breakthrough in enabling that, but there's still an outstanding question to me of how to get Web Package bundles into the content-index.

Available from main thread without SW registeration

Why not directly available from the main thread without having a SW registered (like the Cache API)?

Categories, meaning & extensibility

I'm a bit concerned around the extensibility situation for categories.

First, in the spec itself, a comprehensive listing of categories that are guaranteed to be understood as a baseline would be a good idea.

The list that is there (in the IDL) is undescribed and opaque; what's the meaning of the different values, and when should they be used? E.G., I see audio and article are separate categories; which one does a spoken rendition of a magazine article fall into? That also seems quite different to a piece of music, but both could fall under audio. Maybe this should be using something like schema.org's ontology, taking a combination of all the (understood) categories.

Categories are inextensible without some way to indicate less-preferred-but-more-understood fallbacks. ARIA and some CSS properties have a first-understood-value-wins rule; alternatively, a combination of all the applicable understood values could be used, like RDFa and microformats.

(This point really is a quibble.) Using un-namespaced tokens for category means that extension is risky, as someone else might be using the values you add. If the values were URLs, anyone could mint new values without risking collisions.

Broken references in Content Index

While crawling Content Index, the following links to other specifications were detected as pointing to non-existing anchors, which should be fixed:

https://dom.spec.whatwg.org/#context-object

_{This issue was detected and reported semi-automatically by Strudy based on data collected in webref.}

Potential as a spam vector

This is a potential spam vector. Browsers should be prepared to deal with malicious websites registering a whole pile of bogus resources to fill up the offline content discovery interface. This isn't a problem with the spec per se since this is in the realm of implementation decisions.

I think something like being able to manually hide all registrations from a particular registrable domain will be needed, as well as whatever automatic anti-spam mechanisms get put in place - probably central registrable domain nolists.

I think this will warrant a note in the security section, if only so that it's on implementers' radars.

"This is not a great user experience." Why not? (Undiscoverability.) What's the impact to the developer? It'd be great to highlight why this is interesting for developers.
"...developers can expose fresh content..." - what does this mean?
"This allows the user to browser..." - no, it doesn't. It allows the browser to provide UI that then allows the user to browse.
"...and potentially search on-device for a specific article." - This sounds too hypothetical. I'd focus on mentioning some examples of where this data could go and what it enables: help users discover available content whilst offline, add extra details to browsing history, participate in on-device search, etc..

Why

This very much focuses on the offline case. I would also detail other discoverability reasons: particularly rich highlights might be worth mentioning.

Combined with other APIs

While the example with the Periodic Sync API is apt, by using it you're proposing something on top of a proposal, which is perceived as a risk. I would suggest using the Push API instead w/ pre-caching content on a breaking news notification.
Rather than using await in a loop, which makes for a sequential operation, store the promises in an array and then await for Promise.all(), making it a parallel operation.

There also are a few other things I think would be good to mention:

Privacy and security. While obvious, still good practice to state it.
Quality enforcement. Since this isn't providing a storage layer, how does it solve the problem of making content available online? (This proposal addresses the developer incentive.) What if a developer puts a million items in their cache?
Related: why does this interface live off the ServiceWorkerRegistration?
Reasoning as to why there are different categories of data.
Alternatives. Many browsers already suggest content to users, why is that not sufficient? Where do other hot projects like Web Packaging come in?

I'm sure you've seen the following document, but just in case:
https://github.com/w3ctag/w3ctag.github.io/blob/master/explainers.md