nla / outbackcdx Goto Github PK

View Code? Open in Web Editor NEW

30.0 23.0 20.0 824 KB

Web archive index server based on RocksDB

License: Apache License 2.0

Shell 0.81% Java 82.45% HTML 7.09% JavaScript 9.35% Dockerfile 0.30%

web-archiving wayback

outbackcdx's Introduction

OutbackCDX (nee tinycdxserver)

A RocksDB-based capture index (CDX) server for web archives.

API Documentation

Features:

Speaks both OpenWayback (XML) and PyWb (JSON) CDX protocols
Realtime, incremental updates
Compressed indexes (varint packing + snappy), typically 1/4 - 1/5 the size of CDX files.
Primary-secondary replication
Access control (experimental, see below)
CDXJ (experimental, requires index version 5)

Things it doesn't do (yet):

Sharding

Used in production at the National Library of Australia and British Library with 8-9 billion record indexes.

Installing

OutbackCDX requires JDK 8 and 11 on x86-64 Linux, Windows or MacOS (other platforms would require a custom build of RocksDB JNI).

Pre-compiled jar packages are available from the releases page.

To build from source install Maven and then run:

mvn package

Usage

Run with:

java -Xmx512m -jar outbackcdx*.jar

Command line options:

Usage: java -jar outbackcdx.jar [options...]

  -b bindaddr           Bind to a particular IP address
  -c, --context-path url-prefix
                        Set a URL prefix for the application to be mounted under
  -d datadir            Directory to store index data under
  -i                    Inherit the server socket via STDIN (for use with systemd, inetd etc)
  -j jwks-url perm-path Use JSON Web Tokens for authorization
  -k url realm clientid Use a Keycloak server for authorization
  -m max-open-files     Limit the number of open .sst files to control memory usage
                        (default 396 based on system RAM and ulimit -n)
  --max-num-results N   Max number of records to scan to calculate numresults statistic in the XML protocol (default 10000)
  -p port               Local port to listen on
  -t count              Number of web server threads
  -r count              Cap on number of rocksdb records to scan to serve a single request
  -x                    Output CDX14 by default (instead of CDX11)
  -v                    Verbose logging
  -y file               Custom fuzzy match canonicalization YAML configuration file

Primary mode (runs as a replication target for downstream Secondaries)
  --replication-window interval      interval, in seconds, to delete replication history from disk.
                                     0 disables automatic deletion. History files can be deleted manually by
                                     POSTing a replication sequenceNumber to /<collection>/truncate_replication

Secondary mode (runs read-only; polls upstream server on 'collection-url' for changes)
  --primary collection-url           URL of collection on upstream primary to poll for changes
  --update-interval poll-interval    Polling frequency for upstream changes, in seconds. Default: 10
  --accept-writes                    Allow writes to this node, even though running as a secondary
  --batch-size                       Approximate max size (in bytes) per replication batch

The server supports multiple named indexes as subdirectories. Currently indexes are created automatically when you first write records to them.

Loading records

OutbackCDX does not include a CDX indexing tool for reading WARC or ARC files. Use the cdx-indexer scripts included with OpenWayback or PyWb.

You can load records into the index by POSTing them in the (11-field) CDX format Wayback uses:

$ cdx-indexer mycrawlw.warc.gz > records.cdx
$ curl -X POST --data-binary @records.cdx http://localhost:8080/myindex
Added 542 records

The canonicalized URL (first field) is ignored, OutbackCDX performs its own canonicalization.

By default OutbackCDX will not ingest any records from a POSTed CDX if any of the lines are invalid. If you wish to only skip malformed lines and have OutbackCDX ingest all the other, valid lines you can add the parameter badLines with the value skip. Example:

$ curl -X POST --data-binary @records.cdx http://localhost:8080/myindex?badLines=skip

Limitation: Loading an extremely large number of CDX records in one POST request can cause an out of memory error. Until this is fixed you may need to break your request up into several smaller ones. Most users send one POST per WARC file.

Deleting records

Deleting records works the same way as loading them. POST the records you wish to delete to /{collection}/delete:

$ curl -X POST --data-binary @records.cdx http://localhost:8080/myindex/delete
Deleted 542 records

When deleting OutbackCDX does not check whether the records actually existed in the index. Deleting non-existent records has no effect and will not cause an error.

Querying

Records can be queried in CDX format:

$ curl 'http://localhost:8080/myindex?url=example.org'
org,example)/ 20030402160014 http://example.org/ text/html 200 MOH7IEN2JAEJOHYXIEPEEGHOHG5VI=== - - 2248 396 mycrawl.warc.gz

CDX formatted as JSON arrays:

$ curl 'http://localhost:8080/myindex?url=example.org&output=json'
[
  [
    "org,example)/",
    20030402160014,
    "http://example.org/",
    "text/html",
    200,
    "MOH7IEN2JAEJOHYXIEPEEGHOHG5VI===",
    2248,
    396,
    "mycrawl.warc.gz"
  ]
]

OpenWayback "OpenSearch" XML:

$ curl 'http://localhost:8080/myindex?q=type:urlquery+url:http%3A%2F%2Fexample.org%2F'
<?xml version="1.0" encoding="UTF-8"?>
<wayback>
   <results>
       <result>
           <compressedoffset>396</compressedoffset>
           <compressedendoffset>2248</compressedendoffset>
           <mimetype>text/html</mimetype>
           <file>mycrawl.warc.gz</file>
           <redirecturl>-</redirecturl>
           <urlkey>org,example)/</urlkey>
           <digest>MOH7IEN2JAEJOHYXIEPEEGHOHG5VI===</digest>
           <httpresponsecode>200</httpresponsecode>
           <robotflags>-</robotflags>
           <url>http://example.org/</url>
           <capturedate>20030402160014</capturedate>
       </result>
   </results>
   <request>
       <startdate>19960101000000</startdate>
       <enddate>20180526162512</enddate>
       <type>urlquery</type>
       <firstreturned>0</firstreturned>
       <url>org,example)/</url>
       <resultsrequested>10000</resultsrequested>
       <resultstype>resultstypecapture</resultstype>
       <numreturned>1</numreturned>
       <numresults>1</numresults>
   </request>
</wayback>

Query URLs that match a given URL prefix:

$ curl 'http://localhost:8080/myindex?url=http://example.org/abc&matchType=prefix'

Find the first 5 URLs with a given domain:

$ curl 'http://localhost:8080/myindex?url=example.org&matchType=domain&limit=5'

Find the next 10 URLs in the index starting from the given URL prefix:

$ curl 'http://localhost:8080/myindex?url=http://example.org/abc&matchType=range&limit=10'

Return results in reverse order:

$ curl 'http://localhost:8080/myindex?url=example.org&sort=reverse'

Return results ordered closest to furthest from a given timestamp:

$ curl 'http://localhost:8080/myindex?url=example.org&sort=closest&closest=20030402172120'

See the API Documentation for more details about the available options.

Configuring replay tools

OpenWayback

Point Wayback at a OutbackCDX index by configuring a RemoteResourceIndex. See the example RemoteCollection.xml shipped with OpenWayback.

    <property name="resourceIndex">
      <bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
        <property name="searchUrlBase" value="http://localhost:8080/myindex" />
      </bean>
    </property>

PyWb

Create a pywb config.yaml file containing:

collections:
  testcol:
    archive_paths: /tmp/warcs/
    #archive_paths: http://remote.example.org/warcs/
    index_paths: cdx+http://localhost:8080/myindex

See pywb's documentation for more details.

Recommendation: In some cases where an index contains huge numbers of snapshots of the same URL Pywb can request too many records and run out of memory. To prevent this from happening when using with pywb it's currently recommended to run OutbackCDX with a record scan cap option like -r 10000. Future versions of OutbackCDX will likely apply a default limit to the number of records returned.

Heritrix

The ukwa-heritrix project includes some classes that allow OutbackCDX to be used as a source of deduplication data for Heritrix crawls.

Access Control

Access control can be enabled by setting the following environment variable:

EXPERIMENTAL_ACCESS_CONTROL=1

Rules can be configured through the GUI. Have Wayback or other clients query a particular named access point. For example to query the 'public' access point.

http://localhost:8080/myindex/ap/public

See docs/access-control.md for details of the access control model.

Canonicalisation Aliases

Alias records allow the grouping of URLs so they will deliver as if they are different snapshots of the same page.

@alias <alias-url> <target-url>

For example, if an index contains records for www.example.org URLs, aliases can be added so queries for legacy.example.org URLs will resolve to www.example.org:

@alias http://legacy.example.org/page-one http://www.example.org/page1
@alias http://legacy.example.org/page-two http://www.example.org/page2

Aliases do not currently work with url prefix queries. Aliases are resolved after normal canonicalisation rules are applied.

Aliases can be mixed with regular CDX lines either in the same file or separate files and in any order. Any existing records that the alias rule affects the canonicalised URL for will be updated when the alias is added to the index.

Aliases can be deleted but writing new records while simultaneously deleting aliases that affect them may result in an inconsistent index.

Tuning Memory Usage

RocksDB some data in memory (binary search index, bloom filter) for each open SST file. This improves performance at the cost of using more memory. OutbackCDX uses the following heuristic by default to limit the maximum number of open SST files in an attempt not to exhaust the system's memory.

RocksDB max_open_files = (totalSystemRam / 2 - maxJvmHeap) / 10 MB

This default may not be suitable when multiple large indexes are in use or when OutbackCDX is sharing a server with many other processes. You can override the limit OutbackCDX's -m option.

If you find OutbackCDX using too much memory or you need more performance try adjusting the limit. The optimal setting will depend on your index size and hardware. If you have a lot of memory -m -1 (no limit) will allow RocksDB to open all SST files on startup and should give the best query performance. However with slow disks it can also make startup very slow. You may also need to increase the kernel's max open file description limit (ulimit -n).

Also make sure you're limiting the Java heap size with a JVM option like -Xmx512m. By default Java will allow the heap to grow to half the size of physical RAM which is usually excessive.

Authorization

By default OutbackCDX is unsecured and assumes some external method of authorization such as firewall rules or a reverse proxy are used to secure it. Take care not to expose it to the public internet.

Alternatively one of the following authorization methods can be enabled.

Generic JWT authorization

Authorization to modify the index and access control rules can be controlled using JSON Web Tokens. To enable this you will typically use some sort of separate authentication server to sign the JWTs.

OutbackCDX's -j option takes two arguments, a JWKS URL for the public key of the auth server and a slash-delimited path for where to find the list of permissions in the JWT received as a HTTP bearer token. Refer to your auth server's documentation for what to use.

Currently the OutbackCDX web dashboard does not support generic JWT/OIDC authorization. (Patches welcome.)

Keycloak authorization

OutbackCDX can use Keycloak as an auth server to secure both the API and dashboard.

In your Keycloak realm's settings create a new client for OutbackCDX with the protocol openid-connect and the URL of your OutbackCDX instance.
Under the client's roles tab create the following roles:
- index_edit - can create or delete index records
- rules_edit - can create, modify or delete access rules
- policies_edit - can create, modify or delete access policies
Map your users or service accounts to these client roles as appropriate.
Run OutbackCDX with this option:

-k https://{keycloak-server}/auth {realm} {client-id}

Note: JWT authentication will be enabled automatically when using Keycloak. You don't need to set the -j option.

HMAC fields

OutbackCDX can be configured to compute a field using a HMAC or cryptographic digest. This feature is intended to be used in conjunction with a web server or cloud storage provider which provides temporary access to WARC files using a signed URL. To allow compatibility with a variety of different storage servers the structure of the message and field values are configured using templates.

--hmac-field name algorithm message-template field-template secret-key expiry-secs

The field will be made available as name to the fl CDX query parameter. Multiple HMAC fields can be defined as long as they have different names.

The algorithm may be one of HmacSHA256, HmacSHA1, HmacMD5, SHA-256, SHA-1, MD5 or any other MAC or MessageDigest from a Java security provider. Your system may have additional algorithms available depending on the version and configuration of Java.

The message-template configures the input to the HMAC or digest function. See the list of templates variables below.

The field-template configures the field value returned and is typically used to construct a URL. See the list of templates variables below.

The secret-key is the key of the HMAC functions. When using non-HMAC digest functions (which don't have a natural key parameter) the key may be substituted into the message-template using $secret_value.

The expiry-secs parameter is used to calculate an expiry time for this secure link. If you don't use the $expires variable just set it to zero.

Template variables

In addition to the fields of each capture record ($filename, $length, $offset etc) the following extra variables are available in templates:

$dollar - a dollar sign ("$")
$expires - expiry time in seconds since unix epoch
$expires_hex - expiry time in hexadecimal seconds since unix epoch
$expires_iso8601 - expiry time as a UTC ISO 8601 timestamp
$hmac_base64 - computed hmac/digest value as a base64 string (only available in value template)
$hmac_base64_pct - computed hmac/digest value as a base64 string with + encoded as %2B
$hmac_base64_url - computed hmac/digest value as a base64 url-safe string
$hmac_hex - computed hmac/digest value as a hex string (only available in value template)
$secret_key - the secret key (only available in message template)
$now - current time in seconds since unix epoch
$now_hex - current time in hexadecimal seconds since unix epoch
$now_iso8601 - current time as a UTC ISO 8601 timestamp
$CR - a carriage return ("\r")
$CRLF - a carriage return line feed ("\r\n")
$LF - a line feed ("\n")

The alternative variable syntax ${filename} may also be used.

HMAC field examples

nginx HTTP secure link module

Note: The secure link module bundled with nginx uses the insecure MD5 algorithm. Consider using the community-developed HMAC secure link module instead.

Example nginx configuration:

location /warcs/ {
   secure_link $arg_md5,$arg_expires;
   secure_link_md5 "$secure_link_expires|$uri|$http_range|secret";
   if ($secure_link != "1") { return 403; }
   ...
}

Corresponding OutbackCDX option:

--hmac-field warcurl md5 '$expires|/warcs/$filename|$range|$secret_key'
     'http://nginx.example.org/warcs/$filename?expires=$expires&md5=$hmac_base64_url'
     secret 3600

nginx HTTP HMAC secure link module

(As yet untested.)

Example nginx configuration:

location /warcs/ {
   secure_link_hmac  $arg_st,$arg_ts,$arg_e;
   secure_link_hmac_algorithm sha256;
   secure_link_hmac_secret secret;
   secure_link_hmac_message $uri|$arg_ts|$arg_e|$http_range;
   if ($secure_link_hmac != "1") { return 403; }
   ...
}

Corresponding OutbackCDX option:

--hmac-field warcurl Hmacsha256 '/warcs/$filename|$now|3600|$http_range'
     'http://nginx.example.org/warcs/$filename?st=$hmac_base64_url&ts=$now&e=3600
     secret 0

lighttpd mod_secdownload

Example lighttpd configuration:

secdownload.algorithm       = "hmac-sha256" 
secdownload.secret          = "secret" 
secdownload.document-root   = "/data/warcs/" 
secdownload.uri-prefix      = "/warcs/" 
secdownload.timeout         = 3600

Corresponding OutbackCDX option:

--hmac-field warcurl Hmacsha256 '/$now_hex/$filename'
   'http://lighttpd.example.org/warcs/$hmac_base64_url/$now_hex/$filename' secret 0

S3 signed URLs

(Based on the S3 documentation but as yet untested.)

Replace s3-access-key-id, s3-secret-key and bucket with appropriate values:

--hmac-field url Hmacsha1 'GET$LF$LF$LF$expires$LF/bucket/$filename'
     'https://s3.amazonaws.com/bucket/$filename?AWSAccessKeyId=s3-access-key-id&Expires=$expires&Signature=$hmac_base64_pct'
     s3-secret-key 3600

outbackcdx's People

Contributors

Stargazers

Watchers

Forkers

ptrourke ukwa nlevitt makyo machawk1 kaij jkafader btellstrom leeyouge galgeek solversa bitbaron kris-sigur aponb mirrorweb campgareth edsu internetarchive anjackson

outbackcdx's Issues

Better error messages for incorrectly encoded xmlquery

/trove?q=type:urlquery+url:http://www.tisn.gov.au/Documents/CIPMA+tasking+application+form.doc unhelpfully returns
java.lang.ArrayIndexOutOfBoundsException

(The problem is that the value of the url field needs to url-encoded twice. Once for the query string and once for opensearch.)

Access rule zero has no delete button

Due to JavaScript falsey test

Categories of/reasons for takedown request [ACC6]

Including categories of takedown requests (as identified in the Takedown Request webform) in the Access Control Tool would assist staff in assessing takedown requests. In addition, the tool should be able to report on takedowns by category (defamation, obscenity, privacy etc).

Possible race condition under load

I've been running some load tests on OutbackCDX, and as indicated in this comment when I run 1000 threads (all running the same GET 100 times), I start seeing odd errors:

Premature end of chunk coded message body: closing chunk expected
Socket closed

It's rock solid at 750 threads, but at 1000 it goes wonky. The same client works fine at 1000 threads when running against an NGINX instance configured to respond as if it was OutbackCDX.

So, this seems like it might be a subtle race condition under load in OutbackCDX itself?

EDIT Sorry should have mentioned that this is irrespective of the number of threads OutbackCDX is configured to use (as long as it's plenty!) and doesn't see to be related to ulimits (which manifests itself differently).

CDX11+3 support

It'd be very useful if outback supported CDX11+3, I've done some initial work here: #65 although it may require rework/guidance on it's suitability.

Option to enable fsync on writes

By default RocksDB writes batches asynchronously. This means in a power loss scenario, it's possible for recently committed writes to be lost. This may or may not be acceptable depending on how you use OutbackCDX. Also depending on your hardware (e.g. battery-backed write caches) fsync can be either quite slow or quite fast so we probably want to default to sync for safety but make it configurable.

Problem with legacy pandora urls

http://hoist.nla.gov.au:9901/trove?url=http://pandora.nla.gov.au/pan/85172/20080604-1516/*

The mpg files don't resolve.

Delete function

POSTing a CDX file to /{collection}/delete should delete the submitted records from the index.

Support for robotsflag field based on OpenWayback behaviour

We were looking at using the <robotstxt> field, which is currently not supported by tinycdxserver, and were wondering if you'd be happy for us to submit a pull request that enables it?

The implementation in OpenWayback is rather odd, in that it populate this field using the M meta tags (AIF) field (see here). It's not clear why the meta tags field becomes the robotstxt field, but AFAICT this is the only way to populate that field via the CDX format.

It doesn't look like too difficult a change, but given that it's nearly there but commented out I thought I'd better ask if there's a problem? Presumably the indexes won't be compatible either?

resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record

I observed this with fuzzy match handling, not sure whether this is an issue for other records.

The original records:

com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=1&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=1&rpc_init=1&conviva_environment=test text/html 200 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 3163 294009561 ...
com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=2&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704 ...

Only one record returned from outbackcdx after ingest:

fuzzy:com,twitter)/i/videos/tweet/731859894710718465? 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704  ...

Note that I was able to update this record properly by re-posting only the status 200 record.

Handling of `+` characters in queries

Out playback system is making requests like this:

http://bigcdx.n45.wa.bl.uk:9090/data-heritrix?url=http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African+women4.png

and getting no hits. This works though:

http://bigcdx.n45.wa.bl.uk:9090/data-heritrix?url=http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African%2Bwomen4.png

giving

com,blogspot,bp,3)/-w8iwj9tfz-i/utcs2d5pt-i/aaaaaaaaai4/8bcbtlsj3ti/s320/african+women4.png 20130307094428 http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African+women4.png image/png 200 GRDKSKQHAA72NSYKH6UUAOAELGHBKGPW - - 0 1840918 /data/93028436/92898398/WARCS/BL-92898398-20130307094424-00000-safari.bl.uk.warc.gz

Is OutbackCDX not handling + characters correctly? Or is the client expected to handle this escaping, i.e.

http://bigcdx.n45.wa.bl.uk:9090/data-heritrix?url=http%3A%2F%2F3.bp.blogspot.com%2F-W8IWj9tFz-I%2FUTCS2D5Pt-I%2FAAAAAAAAAI4%2F8BCbTLsJ3tI%2Fs320%2FAfrican%2Bwomen4.png

(Currently pywb thinks this is an OutbackCDX problem)

Remove openwayback-access-control-core dependency

It pulls in 45+ transitive dependencies that we don't really need.

Dashboard: sidebar collection prefixing highlight bug

Selecting 'trove-old' also styles the 'trove' collection as active.

Handling URLs that end with *

In a wide crawl, we appear to be hitting URLs that end with *, which leads to queries to OutbackCDX that look like:

/dc?limit=1&sort=reverse&url=https%3A%2F%2Fhips.hearstapps.com%2Ftoc.h-cdn.co%2Fassets%2F16%2F46%2F3200x1600%2Flandscape-1479498518-cindy-crawford-rande-gerber-house.jpg%3Fresize%3D1200%3A*

The * on the end forces the matchType to be PREFIX and this is true even if you specify a matchType parameter, and even if the * is encoded as %2A.

For now, I'll work around it but I'd like to know how best to handle this situation in the future.

Thanks!

Link Access Control Tool to Records Management system [ACC2]

Add a free text field to the 'Rule Edit' screen in Outback CDX.
Label the field "RefTracker question number (for takedown requests)".
This field is not mandatory and will be left blank in access rules which are not tied to a formal takedown request.

Possible infinite loop upon malformed requests

This may be an old problem, as this is an error we're seeing in our old version, but we seem to have hit a condition where tinycdxserver gets caught in an infinite error loop and spools out huge amounts of error logging. The errors look like this:

java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
        at java.net.URLDecoder.decode(URLDecoder.java:187)
        at tinycdxserver.XmlQuery.decodeQueryString(XmlQuery.java:49)
        at tinycdxserver.XmlQuery.<init>(XmlQuery.java:32)
        at tinycdxserver.XmlQuery.query(XmlQuery.java:192)
        at tinycdxserver.Server.query(Server.java:182)
        at tinycdxserver.Server.serve(Server.java:126)
        at tinycdxserver.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:831)
        at tinycdxserver.NanoHTTPD$1$1.run(NanoHTTPD.java:205)
        at java.lang.Thread.run(Thread.java:745)
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:141)
        at tinycdxserver.NanoHTTPD$Response.send(NanoHTTPD.java:535)
        at tinycdxserver.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:840)
        at tinycdxserver.NanoHTTPD$1$1.run(NanoHTTPD.java:205)
        at java.lang.Thread.run(Thread.java:745)

As I said, this is the old version so may no longer be relevant, but is there any way this error could lead to an infinite loop? I notice there's a while (!myServerSocket.isClosed()) so perhaps some edge case is stopping the socket getting closed? Or possibly at while (!finalAccept.isClosed()).

Upgrade rocksdb

We're quite a few releases behind now and in particular we're running into compatibility problems with newer build environments (Docker, Apple Silicon).

I believe it should be backwards compatible but want to do a test with a larger index to make sure there's no surprises.

CDXJ: Error: no such capture field: method

When posting a CDXJ file (generated with pywb 2.6.7) to the OutbackCDX on DockerHub (v0.11.0?) like so

curl -X POST --data-binary @index.cdxj http://localhost:8080/coll

I'm seeing the following error get printed to the console:

At line: com,google-analytics)/collect?__wb_method=post&__wb_post_data=dj0xjl92pwo5nizhaxa9mszhptc2ndcxodg1myz0pxbhz2v2awv3jl9zptemzgw9ahr0chmlm0elmkylmkzhcg9klm5hc2euz292jtjgyxbvzcuyrmfwmjiwmza3lmh0bwwmzha9jtjgyxbvzcuyrmfwmjiwmza3lmh0bwwmdww9zw4tdxmmzgu9vvrgltgmzhq9qvbprcuzqsuymdiwmjilmjbnyxjjacuymdclmjatjtiwqsuymexpb24lmjbpbiuyme9yaw9ujnnkpte2lwjpdczzcj0xmzywedewmjamdna9mta1mhg4odamamu9mczfdxrtyt0xmtm2otk1ndmumtc1ntcxnza2mi4xnjuymtq0nja0lje2ntixndq2mjaumty1mje0ndyymc4xjl91dg16ptexmzy5otu0my4xnjuymtq0njiwljeums51dg1jc3ilm0qozglyzwn0ksu3q3v0bwnjbiuzrchkaxjly3qpjtdddxrty21kjtnekg5vbmupjl91dg1odd0xnjuymtq0njk0mdg5jl91pvfbq0nbuufcfizqawq9jmdqawq9jmnpzd0xnzu1nze3mdyylje2ntixndq2mdqmdglkpvvbltmzntizmtq1ltemx2dpzd00ntc1ndm3mc4xnjuymtq0nja0jmnkmt1oqvnbjmnkmj1oqvnbjtiwlsuymgfwb2qubmfzys5nb3ymy2qzptiwmtgxmdewjtiwdjqumsuymc0lmjbvbml2zxjzywwlmjbbbmfsexrpy3mmy2q0pxvuc3bly2lmawvkjtnbyxbvzc5uyxnhlmdvdizjzdu9dw5zcgvjawzpzwqlm0fhcg9klm5hc2euz292jmnknj1odhrwcyuzqsuyriuyrmrhcc5kawdpdgfsz292lmdvdiuyrlvuaxzlcnnhbc1gzwrlcmf0zwqtqw5hbhl0awnzlu1pbi5qcyzjzdc9ahr0chmlm0emej0xmjc2mdq0mjew 20220510010455 {"url":"https://www.google-analytics.com/collect","mime":"image/gif","status":"200","digest":"B5HJFHOVXMSWJ55LTR3DHDQE4KJKIKWO","length":"651","offset":"49132028","method":"POST","requestBody":"__wb_post_data=dj0xJl92PWo5NiZhaXA9MSZhPTc2NDcxODg1MyZ0PXBhZ2V2aWV3Jl9zPTEmZGw9aHR0cHMlM0ElMkYlMkZhcG9kLm5hc2EuZ292JTJGYXBvZCUyRmFwMjIwMzA3Lmh0bWwmZHA9JTJGYXBvZCUyRmFwMjIwMzA3Lmh0bWwmdWw9ZW4tdXMmZGU9VVRGLTgmZHQ9QVBPRCUzQSUyMDIwMjIlMjBNYXJjaCUyMDclMjAtJTIwQSUyMExpb24lMjBpbiUyME9yaW9uJnNkPTE2LWJpdCZzcj0xMzYweDEwMjAmdnA9MTA1MHg4ODAmamU9MCZfdXRtYT0xMTM2OTk1NDMuMTc1NTcxNzA2Mi4xNjUyMTQ0NjA0LjE2NTIxNDQ2MjAuMTY1MjE0NDYyMC4xJl91dG16PTExMzY5OTU0My4xNjUyMTQ0NjIwLjEuMS51dG1jc3IlM0QoZGlyZWN0KSU3Q3V0bWNjbiUzRChkaXJlY3QpJTdDdXRtY21kJTNEKG5vbmUpJl91dG1odD0xNjUyMTQ0Njk0MDg5Jl91PVFBQ0NBUUFCfiZqaWQ9JmdqaWQ9JmNpZD0xNzU1NzE3MDYyLjE2NTIxNDQ2MDQmdGlkPVVBLTMzNTIzMTQ1LTEmX2dpZD00NTc1NDM3MC4xNjUyMTQ0NjA0JmNkMT1OQVNBJmNkMj1OQVNBJTIwLSUyMGFwb2QubmFzYS5nb3YmY2QzPTIwMTgxMDEwJTIwdjQuMSUyMC0lMjBVbml2ZXJzYWwlMjBBbmFseXRpY3MmY2Q0PXVuc3BlY2lmaWVkJTNBYXBvZC5uYXNhLmdvdiZjZDU9dW5zcGVjaWZpZWQlM0FhcG9kLm5hc2EuZ292JmNkNj1odHRwcyUzQSUyRiUyRmRhcC5kaWdpdGFsZ292LmdvdiUyRlVuaXZlcnNhbC1GZWRlcmF0ZWQtQW5hbHl0aWNzLU1pbi5qcyZjZDc9aHR0cHMlM0Emej0xMjc2MDQ0MjEw","filename":"apod.warc.gz"}
java.lang.IllegalArgumentException: no such capture field: method
	at outbackcdx.Capture.put(Capture.java:548)
	at outbackcdx.Capture.fromCdxjLine(Capture.java:434)
	at outbackcdx.Capture.fromCdxLine(Capture.java:385)
	at outbackcdx.Webapp.post(Webapp.java:249)
	at outbackcdx.Webapp.lambda$new$3(Webapp.java:102)
	at outbackcdx.Web$Route.handle(Web.java:312)
	at outbackcdx.Web$Router.handle(Web.java:236)
	at outbackcdx.Webapp.handle(Webapp.java:594)
	at outbackcdx.Web$Server.serve(Web.java:50)
	at outbackcdx.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:848)
	at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Other CDXJ files seem to work normally however.

OWB compat: annotate 'closest' capture for BubbleCalendar

Looks like we should read the date query field (e.g. /coll?q=date%3A20141115000000+...) and annotate the closest-dated capture with <closest>true</closest>.

See iipc/openwayback#438

Problems with large cdx files

So I know I am most likely miss using outback but I have a couple of very large cdx files I'd like to move over to outback however posting them to outback causes outback to slowly consume more and more memory until it runs out and crashes. The command I'm using to post the data is:

curl -o upload.txt --progress-bar -X POST -T records.cdx http://localhost:8080/myindex

Dashboard access token needs to be refreshed

5 minutes after sign in the dashboard access token is being expired. Need to check and refresh on each API call.

From the KeyCloak docs:

One thing to keep in mind is that the access token by default has a short life expiration so you may need to refresh the access token prior to sending the request. You can do this by the updateToken method. The updateToken method returns a promise object which makes it easy to invoke the service only if the token was successfully refreshed and for example display an error to the user if it wasn’t. For example:
keycloak.updateToken(30).success(function() {
    loadData();
}).error(function() {
    alert('Failed to refresh token');
);

Warc resources not loading

I'm having trouble with a lot of resources not loading when using tinyCDXserver with OpenWayback 2.2.
So far I can query the CDX fine through OpenWayback, and it appears to load the html of the page, but all the subsequent calls for the resources are not being found (css, js, images etc). A sample of my OpenWayback output

Feb 03, 2016 11:57:26 AM org.archive.wayback.resourceindex.RemoteResourceIndex$1 initialValue SEVERE: Builder is not namespace aware. Feb 03, 2016 11:57:26 AM org.archive.wayback.resourceindex.RemoteResourceIndex$1 initialValue SEVERE: Builder is not namespace aware. Feb 03, 2016 11:57:26 AM org.archive.wayback.resourceindex.RemoteResourceIndex$1 initialValue SEVERE: Builder is not namespace aware. Feb 03, 2016 11:57:26 AM org.archive.wayback.resourceindex.RemoteResourceIndex$1 initialValue SEVERE: Builder is not namespace aware. Feb 03, 2016 11:57:27 AM org.archive.wayback.webapp.AccessPoint handleReplay INFO: LOADFAIL: Timeout: Too many retries, limited to 0 Feb 03, 2016 11:57:27 AM org.archive.wayback.webapp.AccessPoint handleReplay WARNING: (1)LOADFAIL: Self-Redirect: No Closest Match Found /20151108060052/http://www.oversixty.co.nz/ui/css/fonts.css Feb 03, 2016 11:57:27 AM org.archive.wayback.webapp.AccessPoint logError WARNING: Runtime Error org.archive.wayback.exception.ResourceNotAvailableException: Self-Redirect: No Closest Match Found at org.archive.wayback.webapp.AccessPoint.handleReplay(AccessPoint.java:811) at org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:313) at org.archive.wayback.util.webapp.RequestMapper.handleRequest(RequestMapper.java:198) at org.archive.wayback.util.webapp.RequestFilter.doFilter(RequestFilter.java:146) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:999) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:565) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:309) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

I can see the resources in the CDX file, and when I switch to using the same CDX as a localcdxcollection all the resources load fine.
Two things in the log caught my eye but I'm not sure if they are just red herrings:

SEVERE: Builder is not namespace aware.
INFO: LOADFAIL: Timeout: Too many retries, limited to 0

Have you guys come across this at all? Is there any configuration that I might be missing (retry limits, namespaces)?
Any help would be much appreciated.

Reporting on access restrictions [ACC1]

There is a need for periodic reporting on takedown and other access restrictions placed on material within the web archive. In particular, reports should be able to differentiate between takedowns and other access restrictions, and between active and inactive restrictions.

Replication secondary applies the last batch over and over

When running in secondary mode OutbackCDX seems to apply the latest write batch over and over even if it has already been applied. I'm not sure this necessarily causes any functional problems but it does mean the index keeps getting updated on disk unnecessarily. The RocksDB log file grows but I assume it will eventually be compacted. It still seems less than ideal though.

How to reproduce:

Run a primary instance:

$ mkdir /tmp/primary
$ java -jar outbackcdx-0.7.0.jar -d /tmp/primary --replication-window 0

Create a collection named 'example' with some record:

$ echo '- 20190101000000 http://example.org/ text/html 200 - - - 1043 333 example.warc.gz' > example.cdx
$ curl --data-binary @example.cdx http://localhost:8080/example

Run a secondary instance:

$ java -jar outbackcdx-0.7.0.jar -d /tmp/secondary -p 8081 --primary http://localhost:8080/example
OutbackCDX http://localhost:8081
Tue Jan 14 17:32:29 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.504s from http://localhost:8080/example/changes?size=10485760&since=0 and our latest sequence number is now 2
Tue Jan 14 17:32:38 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.004s from http://localhost:8080/example/changes?size=10485760&since=1 and our latest sequence number is now 4
Tue Jan 14 17:32:48 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.006s from http://localhost:8080/example/changes?size=10485760&since=1 and our latest sequence number is now 6
Tue Jan 14 17:32:58 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.006s from http://localhost:8080/example/changes?size=10485760&since=1 and our latest sequence number is now 8

CC: @jkafader @nlevitt

Make 'Private Comments' searchable [ACC3]

Currently, Web Archiving staff use the Private Comments field to record information about a particular restriction, including the name of the requestor, as required. Making the 'Private Comments' field searchable would help staff research historical decisions during the assessment process.

Requires a new search UI?

Better error handling for dashboard

We need to say why the error occurred (token expiry, lack of access, 404, server unavailable).

Enhanced audit trail for Access Control [ACC9]

With a user authentication system in place, log the creation and subsequent updates to Access Rules, specifying the username as well as a date/time stamp for each change.
..

At a minimum, as we discussed, could you please add usernames to the Created and Last modified dates, and then staff can add any extra notes they like in the Private Comments?

We may need to add something more sophisticated down the track to log all changes made to access rules, but that’s not considered MVP for this project.

URL:s ending with an asterisk isn't found when searching for the same URL

We have an OutbackCDX collection with a URL ending with an asterisk (*):

se,emnordic)/varumarke?cms_searchstring=*&productbrandfamily=multimediaholders2&productbrandfamily=vtseries&type=products 20170901165831 http://www.emnordic.se/varumarke?Type=Products&Productbrandfamily=multimediaholders2&Productbrandfamily=vtseries&CMS_SearchString=* text/html 200 B2YAZ73DHPERWZAUXYT7I2KLMKQSOLRS - - 124649 1388 Svep/2017-1/09/SWE-KB-KW3-BULK-2017-1-20170901165831-05061-srvvm303.kb.se.warc.gz

For simplicity I created a new collection and added some test data:

curl -sSfX POST --data-binary "se,foo)?baz=*&foo=bar 20220822 http://www.foo.se/?foo=bar&baz=* text/html 200 ABCDE - - 12345 123 archive.warc.gz" http://localhost:8085/other
curl -sSfX POST --data-binary "se,foo)?baz=1&foo=bar 20220822 http://www.foo.se/?foo=bar&baz=1 text/html 200 ABCDE - - 12345 123 archive.warc.gz" http://localhost:8085/other
curl -sSfX POST --data-binary "se,foo)?bar=baz&foo=* 20220822 http://www.foo.se/?bar=baz&foo=* text/html 200 ABCDE - - 12345 123 archive.warc.gz" http://localhost:8085/other

Searching for it with the original URL yields no result:

curl -sSfG 'http://localhost:8085/other' --data-urlencode 'url=http://www.foo.se/foo=bar&baz=*'

but appending an extra ampersand makes the search work:

curl -sSfG 'http://localhost:8085/other' --data-urlencode 'url=http://www.foo.se/?foo=bar&baz=*&'

Also reordering the parameters so that the URL doesn't end with the asterisk as well as replacing the asterisk with %2A (which would then be encoded again to %252A as we use the curl option --data-urlencode) works:

curl -sSfG 'http://localhost:8085/other' --data-urlencode 'url=http://www.foo.se/?baz=*&foo=bar
curl -sSfG 'http://localhost:8085/other' --data-urlencode 'url=http://www.foo.se/?foo=bar&baz=%2A

Note also that the second and third lines of test data don't have the same problem.

Searching in Pywb gives the same results so this does not appear to be a curl issue.

Standardise public messages [ACC4]

Currently, Web Archiving staff compose a unique Public Message for each access rule, as required. The option to choose from a number of a standardised messages would provide an efficiency for staff, as well as more consistent messaging for public users. However, any generic messages should be fully configurable by the business area, so they can be adapted as communications needs change over time. In addition, the tool must retain the ability to compose a non-standard Public Message, as not all cases would fall within pre-determined categories/messaging.

Are updates thread-safe?

I've been building a Hadoop indexer that sends lots of requests to tinycdxserver. The 42 WARC files in my test are processed on 14 'mappers' and according to the CDX file output, correspond to 735,856 CDX lines.

If I make the map jobs submit the CDX lines one-by-one, I get an estimated number of records in tinycdxserver as being 735,764. Slightly off, but close enough to be down to the estimation method.

If I submit in tens of CDX lines, however, I get 170,101 estimated records. In fives, I get 232,710 and then if I try again I get 224,335.

I was originally submitting in chunks of 10,000 and observed some very odd dynamics. The estimated number would go up and up and then suddenly reset to near zero. The first time the drop happens, it seems at the same time as L0 level turns up in the compaction status. i.e. when this line turns up

  L0      3/0          0   0.8      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.1         0         3    0.047          0       0      0

Shown below in context:

** Compaction Stats [default] **
Level    Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt)  KeyIn KeyDrop
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0          0   0.8      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.1         0         3    0.047          0       0      0
 Sum      3/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      0.1         0         3    0.047          0       0      0
 Int      0/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000          0       0      0
Flush(GB): cumulative 0.000, interval 0.000
Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

** DB Stats **
Uptime(secs): 1442.7 total, 910.9 interval
Cumulative writes: 83 writes, 735K keys, 83 batches, 1.0 writes per batch, ingest: 0.20 GB, 0.14 MB/s
Cumulative WAL: 83 writes, 83 syncs, 0.99 writes per sync, written: 0.20 GB, 0.14 MB/s
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.1 seconds
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 0 writes, 0 keys, 0 batches, 0.0 writes per batch, ingest: 0.00 MB, 0.00 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

Estimated number of records: 105237

It seems to be when the .log file is compacted into a table file. Before disappearing, the .log files appear to contain a lot of repetitive URLs, so maybe there's a clue there.

Access rules - URL contains search

The ability to search for a particular string eg "climatechange" within a URL is a "nice to have".

Mark access rules pending review

eg a tickbox in the Rule Edit screen, and some representation of this on the Access Rules tab so pending requests could be identified from the list [...]
to assist in housekeeping, in case we are flooded with takedown requests

Pin the rules to top of the list when flagged?

Handling malformed URIs

We hit an odd edge case. We ended up checking OutbackCDX for some weird URIs thrown up by the crawl, like:

http://allprintjerseyyourlocalembroideryandvinylprintspecialisthomepage/
http://development-social-marketing-strategy-promote-ebola-treatment-seeking-behaviour-sierra-leone/

When querying OutbackCDX, this cause a runtime exception in IDN.toASCII:

java.lang.IllegalArgumentException: The label in the input is too long

Because the domain label is greater than the 63 characters allowed.

Admittedly this is because we're kind of misusing OubackCDX as a crawl status database, rather than a playback index.

That said, would it be worth returning a 400 Bad Request rather than 500? Is there a more elegant way to handle this?

JWT auth: shouldn't enforce claim existence for read-only users

Access control: tester feedback

Make policy a mandatory field

API to scan all records

Should support efficient paging.

Display Rule ID in the Access Control Tool [ACC10]

Currently, each rule in the Access Control Tool (Outback CDX) is assigned a system-generated "Rule ID", which is then used to apply the rules to the entire Web Archive solr index. By making these Rule IDs visible in the Access Control Tool interface, it would be easier to maintain reliable records of action taken as a result of a takedown request.

Possible problem with escaping URLs in the OpenSearch API

Firstly, note that the following issue is being seen on an older version of OutbackCDX (back when it was called tinycdxserver, appears to be at 54cb410). This may have been fixed, in which case we'd like to know if just updating OubackCDX should work against the existing index files?

The actual issue is that we're hitting oddities with URLs with + in them. e.g.

http://www.nta.nhs.uk/wEWAwLA/fRFApntjN0KAqWf8+4KhJYyV8tPyOxKMnDpaBxSb/scripts/css/css/css/404-error.aspx
http://www.qie.eoe.nhs.uk/SearchResults.aspx?tmName=EMERGENCY+MEDICAL+CARE&geocode=Q35&pubnameexact=NCHOD

If you query directly, like this:

http://192.168.45.21:8080/data-heritrix?url=http%3A%2F%2Fwww.qie.eoe.nhs.uk%2FSearchResults.aspx%3FtmName%3DEMERGENCY%2BMEDICAL%2BCARE%26geocode%3DQ35%26pubnameexact%3DNCHOD

we see the results we expect. But if you use the OpenSearch API

http://192.168.45.21:8080/data-heritrix?q=type:urlquery+url:http%3A//www.qie.eoe.nhs.uk/SearchResults.aspx%3FtmName%3DEMERGENCY%2BMEDICAL%2BCARE%26geocode%3DQ35%26pubnameexact%3DNCHOD

we get

<?xml version="1.0" encoding="UTF-8"?><wayback><error><title>Resource Not In Archive</title><message>The Resource you requested is not in this archive.</message></error></wayback>

Also, if I avoid escaping the + the server goes 500 with an ArrayIndexOutOfBoundsException.

Also, if you attempt to use normal Python requests escaping for the q parameter, the whole thing fails because the q=type:urlquery+url:... escape the colons and space etc. Not sure that's incorrect though - I just don't know the OpenSearch spec. well enough to be sure.

Clarify handling of de-duplicated WARC records

Can you tell me whether you use de-duplicated WARCs? And if so, if there's any trick to setting up playback in this situation?

I've been trying to use your CDX server with WARCs with revisit records, and it seems to be incompatible with OpenWayback. AFAICT right now, OWB expects the RemoteCollection to handle the deduplication, and the built-in remote collection handler does not resolve duplicates. I suspect this is actually a problem with OWB, but I thought I'd ask here in case you've already resolved this issue.

Rules editor: hide 'new rule' and 'save' buttons when users doesn't have write access

Access rule editor: clearing 'accessed between' fields doesn't work

If a rule has an accessed between date range filled in, after clearing the fields and pressing save the date is still in the rule.

Workaround:

curl http://localhost:8080/mycollection/access/rules/123 > rule.json
edit rule.json and change "accessed" to {}
curl -i --data @rule.json http://localhost:8080/mycollection/access/rules

Rules editor: bad interaction between 'new rule' and keycloak

When the page first loads it ends up passing the keycloak redirect token stuff to the API
/trove/access/rules/new&state=...

Handle invalid dates

Currently we allow out of range dates to be inserted. This doesn't normally cause problems as the dates aren't parsed. However the access points system does need interpret the dates which causes queries to return no results.

One option is to pad/truncate them to something parsable in #82. Another is to just outright reject them upfront.

Report exception to client in WbCdxApi

At the moment if a WbCdxApi query throws an exception the query results are truncated. While we can't switch status codes as we've already returned the header and are streaming results we could print a message so its at least obvious to humans that something's gone wrong. We could also maybe use chunked-encoding and close early although some http clients might still not treat that as an error.

XML protocol: numreturned and numresults

We don't currently implement these fields of the xml query protocol:

numresults - number of total matching results (?)
numreturned - number of results returned (may differ from numresults due to limits)

It seems both are displayed in various places in the OpenWayback default templates. We never needed this at NLA as when we used OpenWayback we had custom templates that didn't display this information and we do not have many archived URLs that have been captured so much they need pagination. Pywb's implementation of the XML protocol is based on OutbackCDX and so does not use either value.

Unfortunately implementing each of them will have some impact.

To implement numreturned we'd need to do one of:

buffer the results in memory which opens the door to out of memory errors on large result sets
move the <request> element after the <results> element in the XML, it's possible this may break compatibility with some clients
perform the query twice, once to count matches and a second time to stream the results

To implement numresults we'd need to count all matching records instead of stopping at the limit. This will cause a performance penalty to any query that matches more results than the limit. Prefix queries which match very large numbers of URLs will begin to have unpredictable and likely sometimes unacceptable performance.

On a positive note there has been a feature request to just returned counts instead of results and I guess implementing numresults would achieve that when combined with a result limit of zero.

CC @kris-sigur

Better error messages for non-CDX input

e.g. posting the string 'test-integration/test1.cdx' instead of a CDX file (forgetting @ in the curl command-line) currently shows:

$ curl -X POST --data test-integration/test1.cdx  http://localhost:8080/myindex
java.lang.ArrayIndexOutOfBoundsException: 1
At line: test-integration/test1.cdx

Pasting a URL in the query field in the dashboard doesn't work

It seems the onQueryInput event is sometimes not triggered by pasting. Typing works however.

User authentication [ACC8]

Require user authentication/login to access the Access Control Tool. As a possible extension, provide both read-only and read/write access to the system, for increased security.

Signed WARC URL generation

@ikreymer has proposed a web archive architecture with replay capability purely client-side served by static instance of wabac.js, WARC files server by a simple static file server (nginx, S3) and OutbackCDX as the only dynamic server-side component. While technically this obviously already is totally doable it does mean making the full raw WARC files available for download which is likely unacceptable for many institutions who have a requirement to implement some level of restrictions or access controls.

Ilya suggested one solution to this problem would be for the index server to generated signed URLs which include a signature (or some other form of access token) which provides temporary access to specific records.

nginx

There are a lot of different nginx modules that can handle URLs with some kind of signature, HMAC or auth token. The stock secure link module would technically work but is probably best avoided as it uses MD5.

A simple example using https://github.com/nginx-modules/ngx_http_hmac_secure_link_module might be:

location /warcs {
    secure_link_hmac  $arg_token,$arg_timestamp,$arg_expiry;
    secure_link_hmac_secret my_secret_key;
    secure_link_hmac_message $uri|$arg_timestamp|$arg_expiry|$http_range;
    secure_link_hmac_algorithm sha256;
    if ($secure_link_hmac != "1") { return 404; }
}

With a URL that looks like:

https://warcstore/something.warc.gz?timestamp=2020-03-09T09:55:46Z&expiry=900&token=98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4

Note how the HMAC is configured to include $http_range which ensures the request is only valid for a single specific byte range.

S3

S3 has signed URLs which works rather similarly:

https://my-warc-store.s3-eu-west-1.amazonaws.com/something.warc.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE/20130721/us-east-1/s3/aws4_request
&X-Amz-Date=20200409T096646Z
&X-Amz-Expires=900
&X-Amz-Signature=13550350a8681c84c861aac2e5b440161c2b33a3e4f302ac680ca5b686de48de
&X-Amz-SignedHeaders=host;range

Sort 'Access Rules' by URL [ACC5]

When applying a new restriction, it would be useful to sort by URL, to identify any existing restrictions (eg by harvest date, embargo periods, related URLs). In particular, sorting URLs in SURT format would allow staff to identify domains eg parliament.gov.au. The ability to search for a particular string eg "climatechange" within a URL is a "nice to have".

nla / outbackcdx Goto Github PK

outbackcdx's Introduction

OutbackCDX (nee tinycdxserver)

Installing

Usage

Loading records

Deleting records

Querying

Configuring replay tools

OpenWayback

PyWb

Heritrix

Access Control

Canonicalisation Aliases

Tuning Memory Usage

Authorization

Generic JWT authorization

Keycloak authorization

HMAC fields

Template variables

HMAC field examples

lighttpd mod_secdownload

S3 signed URLs

outbackcdx's People

Contributors

Stargazers

Watchers

Forkers

outbackcdx's Issues

nginx

S3

Recommend Projects

Recommend Topics

Recommend Org