ukwa / ukwa-gsheets-utils Goto Github PK

View Code? Open in Web Editor NEW

5.0 10.0 3.0 21 KB

Add-On for Google Sheets to help those working with web archives.

License: Apache License 2.0

JavaScript 100.00%

google-sheets google-sheets-addon web-archives webarchives

ukwa-gsheets-utils's Introduction

ukwa-gsheets-utils

Add-On for Google Sheets to help those working with web archives.

This project is not under development at present, see this issue for more details.

Features

Custom functions that use the Memento API (specifically the TimeGate) to look up whether a given archive holds a given URL. Currenly supports:
- UK Web Archive via =WEBARCHIVE_STATUS_UKWA(<url>)
- UK Government Web Archive via =WEBARCHIVE_STATUS_UKGWA(<url>)
- Internet Archive via =WEBARCHIVE_STATUS_IA(<url>)

There's probably quite a lot more this could do, given these capabilities.

The main restriction here is the 20,000 calls/day default quota per user -- you should bare this in mind if attempting to check large numbers of URLs.

Getting Started

To install it, go here

To see an example of it in use, see this read-only Google Sheet.

Development

This repo is mostly a back-up of the Google Script version, and editing should likley happen there, while we use clasp to make this backup occasionally.

That's a view-only link to the Google Script view, you'll need to request permission to edit.

Ideas

Via Memento API:
- Add custom functions to support TNA and Parliamentary archives TimeGates?
- Add custom function to talk to the Memento Aggregator? Needs to return archival URL rather than just the status?
- Add functions to return first/last memento datestamps?
- Add functions to return counts? e.g. number of copies, or number of URLs starting like X?
- Add functions return the archived URL?
- Add button to replace URLs with archived URLs?
- Add button to colour cells holding URLs based on archival status of URL? (Could also be applied to Google Docs? e.g. get all links)
Via suitable custom API:
- Add a custom function to take a URL and report if there is a record in W3ACT that covers it? (and link to it?) Has it cleared the criteria for NPLD?
- Add a custom function that checks holdings, but also immediately enqueues a request for the URL. i.e. handle in-scope nominations quickly.

Deployment

Check which version is currently live (these are simple version numbers controlled by the deployment service)
Get local repo up to date
clasp login as main GMail account
clasp push
Go to https://script.google.com/ and open up the project 'UKWA Google Sheets Utilities'
Verify changes are present.
Click 'Publish > Deploy as Sheets add-on...'
Update version info etc. as needed.
Follow publication workflow until https://chrome.google.com/webstore/detail/dghejanopbolppcgmihfhnaedjfjoaik updates (or https://chrome.google.com/webstore/detail/ukwa-google-sheets-utilit/dghejanopbolppcgmihfhnaedjfjoaik)
Check version is updated.

If changes are made and tested via the Google Apps Script UI then we can use clasp pull to get the new version and commit it to git.

ukwa-gsheets-utils's People

Contributors

Stargazers

Watchers

Forkers

tpaswa kblumenthal uk-gov-mirror

ukwa-gsheets-utils's Issues

Feature Request: Show Memento Count for a URI-R

I would like to suggest adding the ability for Memento count to be queried and displayed in a spreadsheet cell for a corresponding URI-R. A rudimentary approach would be to scrape a TimeMap for something akin to /rel=".*?memento.*?"/. As mentioned on Twitter, MemGator might be of help here (even if only configured to query a single archive), as it exposes an HTTP response header with this value:

$ memgator server &
$ $ curl -I localhost:1208/timemap/link/http://www.bl.uk
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Link, Location, X-Memento-Count, X-Generator
Content-Type: application/link-format
X-Generator: MemGator:1.0-rc7
X-Memento-Count: 14088
Date: Tue, 19 Mar 2019 15:20:43 GMT

As an extended request, abstracting the "count" value (which can be problematic) to be capable of querying other facets of the mementos in the archive (e.g., how many are 200s or redirects) would make this project's novel use of spreadsheets even more useful for cases beyond its current capability.

Potential JS implementation source to adapt: https://github.com/jarofghosts/memento-client

Add function to return date of most recent UKWA capture of a URL

Our curators have suggested that a function that returns the most recent crawl date (rather than the crawl status) would be useful. e.g.

=WEBARCHIVE_LAST_MEMENTO_DATE_UKWA({url})

Switch to range operation given a suitable API

The limitation on the number of URL requests rapidly becomes a problem for our use case. Hence e.g. #2

As per the optimisation section of the custom function docs, this could be avoided if the function worked on ranges, and used this to batch up requests to a back-end API that could make e.g. 100 checks in a single call from the Google Sheet.

Somewhat fiddly, but it could work something like:

Clone the 2-D input to an output array, possibly via input.map and a lookup to a suitable Cache.
Iterate through the row and columns...
As we go, if the Cache didn't have a value, store the row and column indices and the URL together, appending to an array.
When the array reaches a size of e.g. 100 items (or we get to the end), we launch a single UrlFetch to an API that performs the lookup of the batch.
When this batch comes back, we populate the output matrix with the right answer, and any Cache too.
Return the complete output matrix.

First problem is we'd need to add a suitable batch lookup API.

Update authentication and associated policies to comply with Google's requirements

Quoting from here:

A while ago, we experimented with an add-on for Google Sheets that provided a way to query web archive holdings from an online spreadsheet (this COPTR link provides some additional information).

Unfortunately, this has become unavailable due to a particular kind of digital obsolescence: changes to Google’s policies. To make it work again, we have to modify our formal policies and documentation in a way that meets Google’s specific requirements. Realistically, due to other work taking priority, it’s likely to be some time before we are able to look at restoring it.

Google's add-on management and deployment systems are pretty complex, not just in terms of technology, but also administratively and in terms of policy requirements.

Here's a specific example (one that represents the point where I gave up trying to make this work!): Google requires a link to a privacy policy for them to review, and to present to users. Furthermore, they enforce the constraint that this policy must be at a URL that is the same host as is associated with the other services. The UK Web Archive operates from a separate domain to our parent organisation, but operates under our parent organisation's privacy policy. To make things work, we'd likely have to operate this as a service hosted out of the main BL domain, but that makes things massively more complex to manage internally.

(The underlying issue here seems to be that by operating a web service integrated with Sheets, we effectively have access to user data, so we have to meet certain criteria. We really do not want any private user data, but it's hard to find a way to integrate with these tools that means we guarantee we don't. But even just understanding these issues burns a lot of time.)

All this additional complexity means managing and maintaining this Google Sheets is beyond our capabilities at present. We are open to hearing about ways to cope with this, or perhaps ways of empowering users so they can do this kind of thing themselves e.g. via APIs or other standard integration patterns.

Feature Request: Provide a means of querying quota

For a large number of requests, as mentioned in the README, a user may hit the quota ceiling. It would be useful to allow this value to be queried and displayed in a cell to give more insight as to why bulk querying is failing at a certain point in a list of URIs.

Use a cache to help stay within limits

Use https://developers.google.com/apps-script/reference/cache to cache (more detailed) HTTP responses, as an alternative method to limit how quickly the quota's get used up. Focus on documenting that and the quota rather than batch APIs?