nla / outbackcdx Goto Github PK

Web archive index server based on RocksDB

License: Apache License 2.0

Shell 0.81% Java 82.45% HTML 7.09% JavaScript 9.35% Dockerfile 0.30%

outbackcdx's Issues

Problems with large cdx files

So I know I am most likely miss using outback but I have a couple of very large cdx files I'd like to move over to outback however posting them to outback causes outback to slowly consume more and more memory until it runs out and crashes. The command I'm using to post the data is:

curl -o upload.txt --progress-bar -X POST -T records.cdx http://localhost:8080/myindex

Handle invalid dates

Currently we allow out of range dates to be inserted. This doesn't normally cause problems as the dates aren't parsed. However the access points system does need interpret the dates which causes queries to return no results.

One option is to pad/truncate them to something parsable in #82. Another is to just outright reject them upfront.

Signed WARC URL generation

@ikreymer has proposed a web archive architecture with replay capability purely client-side served by static instance of wabac.js, WARC files server by a simple static file server (nginx, S3) and OutbackCDX as the only dynamic server-side component. While technically this obviously already is totally doable it does mean making the full raw WARC files available for download which is likely unacceptable for many institutions who have a requirement to implement some level of restrictions or access controls.

Ilya suggested one solution to this problem would be for the index server to generated signed URLs which include a signature (or some other form of access token) which provides temporary access to specific records.

nginx

There are a lot of different nginx modules that can handle URLs with some kind of signature, HMAC or auth token. The stock secure link module would technically work but is probably best avoided as it uses MD5.

A simple example using https://github.com/nginx-modules/ngx_http_hmac_secure_link_module might be:

location /warcs {
    secure_link_hmac  $arg_token,$arg_timestamp,$arg_expiry;
    secure_link_hmac_secret my_secret_key;
    secure_link_hmac_message $uri|$arg_timestamp|$arg_expiry|$http_range;
    secure_link_hmac_algorithm sha256;
    if ($secure_link_hmac != "1") { return 404; }
}

With a URL that looks like:

https://warcstore/something.warc.gz?timestamp=2020-03-09T09:55:46Z&expiry=900&token=98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4

Note how the HMAC is configured to include $http_range which ensures the request is only valid for a single specific byte range.

S3

S3 has signed URLs which works rather similarly:

https://my-warc-store.s3-eu-west-1.amazonaws.com/something.warc.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE/20130721/us-east-1/s3/aws4_request
&X-Amz-Date=20200409T096646Z
&X-Amz-Expires=900
&X-Amz-Signature=13550350a8681c84c861aac2e5b440161c2b33a3e4f302ac680ca5b686de48de
&X-Amz-SignedHeaders=host;range

resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record

I observed this with fuzzy match handling, not sure whether this is an issue for other records.

The original records:

com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=1&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=1&rpc_init=1&conviva_environment=test text/html 200 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 3163 294009561 ...
com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=2&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704 ...

Only one record returned from outbackcdx after ingest:

fuzzy:com,twitter)/i/videos/tweet/731859894710718465? 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704  ...

Note that I was able to update this record properly by re-posting only the status 200 record.

Mark access rules pending review

eg a tickbox in the Rule Edit screen, and some representation of this on the Access Rules tab so pending requests could be identified from the list [...]
to assist in housekeeping, in case we are flooded with takedown requests

Pin the rules to top of the list when flagged?

URL:s ending with an asterisk isn't found when searching for the same URL

We have an OutbackCDX collection with a URL ending with an asterisk (*):

se,emnordic)/varumarke?cms_searchstring=*&productbrandfamily=multimediaholders2&productbrandfamily=vtseries&type=products 20170901165831 http://www.emnordic.se/varumarke?Type=Products&Productbrandfamily=multimediaholders2&Productbrandfamily=vtseries&CMS_SearchString=* text/html 200 B2YAZ73DHPERWZAUXYT7I2KLMKQSOLRS - - 124649 1388 Svep/2017-1/09/SWE-KB-KW3-BULK-2017-1-20170901165831-05061-srvvm303.kb.se.warc.gz

For simplicity I created a new collection and added some test data:

curl -sSfX POST --data-binary "se,foo)?baz=*&foo=bar 20220822 http://www.foo.se/?foo=bar&baz=* text/html 200 ABCDE - - 12345 123 archive.warc.gz" http://localhost:8085/other
curl -sSfX POST --data-binary "se,foo)?baz=1&foo=bar 20220822 http://www.foo.se/?foo=bar&baz=1 text/html 200 ABCDE - - 12345 123 archive.warc.gz" http://localhost:8085/other
curl -sSfX POST --data-binary "se,foo)?bar=baz&foo=* 20220822 http://www.foo.se/?bar=baz&foo=* text/html 200 ABCDE - - 12345 123 archive.warc.gz" http://localhost:8085/other

Searching for it with the original URL yields no result:

curl -sSfG 'http://localhost:8085/other' --data-urlencode 'url=http://www.foo.se/foo=bar&baz=*'

but appending an extra ampersand makes the search work:

curl -sSfG 'http://localhost:8085/other' --data-urlencode 'url=http://www.foo.se/?foo=bar&baz=*&'

Also reordering the parameters so that the URL doesn't end with the asterisk as well as replacing the asterisk with %2A (which would then be encoded again to %252A as we use the curl option --data-urlencode) works:

curl -sSfG 'http://localhost:8085/other' --data-urlencode 'url=http://www.foo.se/?baz=*&foo=bar
curl -sSfG 'http://localhost:8085/other' --data-urlencode 'url=http://www.foo.se/?foo=bar&baz=%2A

Note also that the second and third lines of test data don't have the same problem.

Searching in Pywb gives the same results so this does not appear to be a curl issue.

Possible infinite loop upon malformed requests

This may be an old problem, as this is an error we're seeing in our old version, but we seem to have hit a condition where tinycdxserver gets caught in an infinite error loop and spools out huge amounts of error logging. The errors look like this:

java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
        at java.net.URLDecoder.decode(URLDecoder.java:187)
        at tinycdxserver.XmlQuery.decodeQueryString(XmlQuery.java:49)
        at tinycdxserver.XmlQuery.<init>(XmlQuery.java:32)
        at tinycdxserver.XmlQuery.query(XmlQuery.java:192)
        at tinycdxserver.Server.query(Server.java:182)
        at tinycdxserver.Server.serve(Server.java:126)
        at tinycdxserver.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:831)
        at tinycdxserver.NanoHTTPD$1$1.run(NanoHTTPD.java:205)
        at java.lang.Thread.run(Thread.java:745)
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:141)
        at tinycdxserver.NanoHTTPD$Response.send(NanoHTTPD.java:535)
        at tinycdxserver.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:840)
        at tinycdxserver.NanoHTTPD$1$1.run(NanoHTTPD.java:205)
        at java.lang.Thread.run(Thread.java:745)

As I said, this is the old version so may no longer be relevant, but is there any way this error could lead to an infinite loop? I notice there's a while (!myServerSocket.isClosed()) so perhaps some edge case is stopping the socket getting closed? Or possibly at while (!finalAccept.isClosed()).

Handling URLs that end with *

In a wide crawl, we appear to be hitting URLs that end with *, which leads to queries to OutbackCDX that look like:

/dc?limit=1&sort=reverse&url=https%3A%2F%2Fhips.hearstapps.com%2Ftoc.h-cdn.co%2Fassets%2F16%2F46%2F3200x1600%2Flandscape-1479498518-cindy-crawford-rande-gerber-house.jpg%3Fresize%3D1200%3A*

The * on the end forces the matchType to be PREFIX and this is true even if you specify a matchType parameter, and even if the * is encoded as %2A.

For now, I'll work around it but I'd like to know how best to handle this situation in the future.

Thanks!

CDX11+3 support

It'd be very useful if outback supported CDX11+3, I've done some initial work here: #65 although it may require rework/guidance on it's suitability.

Reporting on access restrictions [ACC1]

There is a need for periodic reporting on takedown and other access restrictions placed on material within the web archive. In particular, reports should be able to differentiate between takedowns and other access restrictions, and between active and inactive restrictions.

Pasting a URL in the query field in the dashboard doesn't work

It seems the onQueryInput event is sometimes not triggered by pasting. Typing works however.

Access rule zero has no delete button

Due to JavaScript falsey test

Categories of/reasons for takedown request [ACC6]

Including categories of takedown requests (as identified in the Takedown Request webform) in the Access Control Tool would assist staff in assessing takedown requests. In addition, the tool should be able to report on takedowns by category (defamation, obscenity, privacy etc).

Access rule editor: clearing 'accessed between' fields doesn't work

If a rule has an accessed between date range filled in, after clearing the fields and pressing save the date is still in the rule.

Workaround:

curl http://localhost:8080/mycollection/access/rules/123 > rule.json
edit rule.json and change "accessed" to {}
curl -i --data @rule.json http://localhost:8080/mycollection/access/rules

API to scan all records

Should support efficient paging.

Link Access Control Tool to Records Management system [ACC2]

Add a free text field to the 'Rule Edit' screen in Outback CDX.
Label the field "RefTracker question number (for takedown requests)".
This field is not mandatory and will be left blank in access rules which are not tied to a formal takedown request.

Standardise public messages [ACC4]

Currently, Web Archiving staff compose a unique Public Message for each access rule, as required. The option to choose from a number of a standardised messages would provide an efficiency for staff, as well as more consistent messaging for public users. However, any generic messages should be fully configurable by the business area, so they can be adapted as communications needs change over time. In addition, the tool must retain the ability to compose a non-standard Public Message, as not all cases would fall within pre-determined categories/messaging.

Report exception to client in WbCdxApi

At the moment if a WbCdxApi query throws an exception the query results are truncated. While we can't switch status codes as we've already returned the header and are streaming results we could print a message so its at least obvious to humans that something's gone wrong. We could also maybe use chunked-encoding and close early although some http clients might still not treat that as an error.

Option to enable fsync on writes

By default RocksDB writes batches asynchronously. This means in a power loss scenario, it's possible for recently committed writes to be lost. This may or may not be acceptable depending on how you use OutbackCDX. Also depending on your hardware (e.g. battery-backed write caches) fsync can be either quite slow or quite fast so we probably want to default to sync for safety but make it configurable.

Sort 'Access Rules' by URL [ACC5]

When applying a new restriction, it would be useful to sort by URL, to identify any existing restrictions (eg by harvest date, embargo periods, related URLs). In particular, sorting URLs in SURT format would allow staff to identify domains eg parliament.gov.au. The ability to search for a particular string eg "climatechange" within a URL is a "nice to have".

Rules editor: bad interaction between 'new rule' and keycloak

When the page first loads it ends up passing the keycloak redirect token stuff to the API
/trove/access/rules/new&state=...

Clarify handling of de-duplicated WARC records

Can you tell me whether you use de-duplicated WARCs? And if so, if there's any trick to setting up playback in this situation?

I've been trying to use your CDX server with WARCs with revisit records, and it seems to be incompatible with OpenWayback. AFAICT right now, OWB expects the RemoteCollection to handle the deduplication, and the built-in remote collection handler does not resolve duplicates. I suspect this is actually a problem with OWB, but I thought I'd ask here in case you've already resolved this issue.

CDXJ: Error: no such capture field: method

When posting a CDXJ file (generated with pywb 2.6.7) to the OutbackCDX on DockerHub (v0.11.0?) like so

curl -X POST --data-binary @index.cdxj http://localhost:8080/coll

I'm seeing the following error get printed to the console:

At line: com,google-analytics)/collect?__wb_method=post&__wb_post_data=dj0xjl92pwo5nizhaxa9mszhptc2ndcxodg1myz0pxbhz2v2awv3jl9zptemzgw9ahr0chmlm0elmkylmkzhcg9klm5hc2euz292jtjgyxbvzcuyrmfwmjiwmza3lmh0bwwmzha9jtjgyxbvzcuyrmfwmjiwmza3lmh0bwwmdww9zw4tdxmmzgu9vvrgltgmzhq9qvbprcuzqsuymdiwmjilmjbnyxjjacuymdclmjatjtiwqsuymexpb24lmjbpbiuyme9yaw9ujnnkpte2lwjpdczzcj0xmzywedewmjamdna9mta1mhg4odamamu9mczfdxrtyt0xmtm2otk1ndmumtc1ntcxnza2mi4xnjuymtq0nja0lje2ntixndq2mjaumty1mje0ndyymc4xjl91dg16ptexmzy5otu0my4xnjuymtq0njiwljeums51dg1jc3ilm0qozglyzwn0ksu3q3v0bwnjbiuzrchkaxjly3qpjtdddxrty21kjtnekg5vbmupjl91dg1odd0xnjuymtq0njk0mdg5jl91pvfbq0nbuufcfizqawq9jmdqawq9jmnpzd0xnzu1nze3mdyylje2ntixndq2mdqmdglkpvvbltmzntizmtq1ltemx2dpzd00ntc1ndm3mc4xnjuymtq0nja0jmnkmt1oqvnbjmnkmj1oqvnbjtiwlsuymgfwb2qubmfzys5nb3ymy2qzptiwmtgxmdewjtiwdjqumsuymc0lmjbvbml2zxjzywwlmjbbbmfsexrpy3mmy2q0pxvuc3bly2lmawvkjtnbyxbvzc5uyxnhlmdvdizjzdu9dw5zcgvjawzpzwqlm0fhcg9klm5hc2euz292jmnknj1odhrwcyuzqsuyriuyrmrhcc5kawdpdgfsz292lmdvdiuyrlvuaxzlcnnhbc1gzwrlcmf0zwqtqw5hbhl0awnzlu1pbi5qcyzjzdc9ahr0chmlm0emej0xmjc2mdq0mjew 20220510010455 {"url":"https://www.google-analytics.com/collect","mime":"image/gif","status":"200","digest":"B5HJFHOVXMSWJ55LTR3DHDQE4KJKIKWO","length":"651","offset":"49132028","method":"POST","requestBody":"__wb_post_data=dj0xJl92PWo5NiZhaXA9MSZhPTc2NDcxODg1MyZ0PXBhZ2V2aWV3Jl9zPTEmZGw9aHR0cHMlM0ElMkYlMkZhcG9kLm5hc2EuZ292JTJGYXBvZCUyRmFwMjIwMzA3Lmh0bWwmZHA9JTJGYXBvZCUyRmFwMjIwMzA3Lmh0bWwmdWw9ZW4tdXMmZGU9VVRGLTgmZHQ9QVBPRCUzQSUyMDIwMjIlMjBNYXJjaCUyMDclMjAtJTIwQSUyMExpb24lMjBpbiUyME9yaW9uJnNkPTE2LWJpdCZzcj0xMzYweDEwMjAmdnA9MTA1MHg4ODAmamU9MCZfdXRtYT0xMTM2OTk1NDMuMTc1NTcxNzA2Mi4xNjUyMTQ0NjA0LjE2NTIxNDQ2MjAuMTY1MjE0NDYyMC4xJl91dG16PTExMzY5OTU0My4xNjUyMTQ0NjIwLjEuMS51dG1jc3IlM0QoZGlyZWN0KSU3Q3V0bWNjbiUzRChkaXJlY3QpJTdDdXRtY21kJTNEKG5vbmUpJl91dG1odD0xNjUyMTQ0Njk0MDg5Jl91PVFBQ0NBUUFCfiZqaWQ9JmdqaWQ9JmNpZD0xNzU1NzE3MDYyLjE2NTIxNDQ2MDQmdGlkPVVBLTMzNTIzMTQ1LTEmX2dpZD00NTc1NDM3MC4xNjUyMTQ0NjA0JmNkMT1OQVNBJmNkMj1OQVNBJTIwLSUyMGFwb2QubmFzYS5nb3YmY2QzPTIwMTgxMDEwJTIwdjQuMSUyMC0lMjBVbml2ZXJzYWwlMjBBbmFseXRpY3MmY2Q0PXVuc3BlY2lmaWVkJTNBYXBvZC5uYXNhLmdvdiZjZDU9dW5zcGVjaWZpZWQlM0FhcG9kLm5hc2EuZ292JmNkNj1odHRwcyUzQSUyRiUyRmRhcC5kaWdpdGFsZ292LmdvdiUyRlVuaXZlcnNhbC1GZWRlcmF0ZWQtQW5hbHl0aWNzLU1pbi5qcyZjZDc9aHR0cHMlM0Emej0xMjc2MDQ0MjEw","filename":"apod.warc.gz"}
java.lang.IllegalArgumentException: no such capture field: method
	at outbackcdx.Capture.put(Capture.java:548)
	at outbackcdx.Capture.fromCdxjLine(Capture.java:434)
	at outbackcdx.Capture.fromCdxLine(Capture.java:385)
	at outbackcdx.Webapp.post(Webapp.java:249)
	at outbackcdx.Webapp.lambda$new$3(Webapp.java:102)
	at outbackcdx.Web$Route.handle(Web.java:312)
	at outbackcdx.Web$Router.handle(Web.java:236)
	at outbackcdx.Webapp.handle(Webapp.java:594)
	at outbackcdx.Web$Server.serve(Web.java:50)
	at outbackcdx.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:848)
	at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Other CDXJ files seem to work normally however.

Possible problem with escaping URLs in the OpenSearch API

Firstly, note that the following issue is being seen on an older version of OutbackCDX (back when it was called tinycdxserver, appears to be at 54cb410). This may have been fixed, in which case we'd like to know if just updating OubackCDX should work against the existing index files?

The actual issue is that we're hitting oddities with URLs with + in them. e.g.

http://www.nta.nhs.uk/wEWAwLA/fRFApntjN0KAqWf8+4KhJYyV8tPyOxKMnDpaBxSb/scripts/css/css/css/404-error.aspx
http://www.qie.eoe.nhs.uk/SearchResults.aspx?tmName=EMERGENCY+MEDICAL+CARE&geocode=Q35&pubnameexact=NCHOD

If you query directly, like this:

http://192.168.45.21:8080/data-heritrix?url=http%3A%2F%2Fwww.qie.eoe.nhs.uk%2FSearchResults.aspx%3FtmName%3DEMERGENCY%2BMEDICAL%2BCARE%26geocode%3DQ35%26pubnameexact%3DNCHOD

we see the results we expect. But if you use the OpenSearch API

http://192.168.45.21:8080/data-heritrix?q=type:urlquery+url:http%3A//www.qie.eoe.nhs.uk/SearchResults.aspx%3FtmName%3DEMERGENCY%2BMEDICAL%2BCARE%26geocode%3DQ35%26pubnameexact%3DNCHOD

we get

<?xml version="1.0" encoding="UTF-8"?><wayback><error><title>Resource Not In Archive</title><message>The Resource you requested is not in this archive.</message></error></wayback>

Also, if I avoid escaping the + the server goes 500 with an ArrayIndexOutOfBoundsException.

Also, if you attempt to use normal Python requests escaping for the q parameter, the whole thing fails because the q=type:urlquery+url:... escape the colons and space etc. Not sure that's incorrect though - I just don't know the OpenSearch spec. well enough to be sure.

Replication secondary applies the last batch over and over

When running in secondary mode OutbackCDX seems to apply the latest write batch over and over even if it has already been applied. I'm not sure this necessarily causes any functional problems but it does mean the index keeps getting updated on disk unnecessarily. The RocksDB log file grows but I assume it will eventually be compacted. It still seems less than ideal though.

How to reproduce:

Run a primary instance:

$ mkdir /tmp/primary
$ java -jar outbackcdx-0.7.0.jar -d /tmp/primary --replication-window 0

Create a collection named 'example' with some record:

$ echo '- 20190101000000 http://example.org/ text/html 200 - - - 1043 333 example.warc.gz' > example.cdx
$ curl --data-binary @example.cdx http://localhost:8080/example

Run a secondary instance:

$ java -jar outbackcdx-0.7.0.jar -d /tmp/secondary -p 8081 --primary http://localhost:8080/example
OutbackCDX http://localhost:8081
Tue Jan 14 17:32:29 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.504s from http://localhost:8080/example/changes?size=10485760&since=0 and our latest sequence number is now 2
Tue Jan 14 17:32:38 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.004s from http://localhost:8080/example/changes?size=10485760&since=1 and our latest sequence number is now 4
Tue Jan 14 17:32:48 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.006s from http://localhost:8080/example/changes?size=10485760&since=1 and our latest sequence number is now 6
Tue Jan 14 17:32:58 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.006s from http://localhost:8080/example/changes?size=10485760&since=1 and our latest sequence number is now 8

CC: @jkafader @nlevitt

Rules editor: hide 'new rule' and 'save' buttons when users doesn't have write access

Remove openwayback-access-control-core dependency

It pulls in 45+ transitive dependencies that we don't really need.

Access control: tester feedback

Make policy a mandatory field

Delete function

POSTing a CDX file to /{collection}/delete should delete the submitted records from the index.

Access rules - URL contains search

The ability to search for a particular string eg "climatechange" within a URL is a "nice to have".

Make 'Private Comments' searchable [ACC3]

Currently, Web Archiving staff use the Private Comments field to record information about a particular restriction, including the name of the requestor, as required. Making the 'Private Comments' field searchable would help staff research historical decisions during the assessment process.

Requires a new search UI?

Dashboard: sidebar collection prefixing highlight bug

Selecting 'trove-old' also styles the 'trove' collection as active.

XML protocol: numreturned and numresults

We don't currently implement these fields of the xml query protocol:

numresults - number of total matching results (?)
numreturned - number of results returned (may differ from numresults due to limits)

It seems both are displayed in various places in the OpenWayback default templates. We never needed this at NLA as when we used OpenWayback we had custom templates that didn't display this information and we do not have many archived URLs that have been captured so much they need pagination. Pywb's implementation of the XML protocol is based on OutbackCDX and so does not use either value.

Unfortunately implementing each of them will have some impact.

To implement numreturned we'd need to do one of:

buffer the results in memory which opens the door to out of memory errors on large result sets
move the <request> element after the <results> element in the XML, it's possible this may break compatibility with some clients
perform the query twice, once to count matches and a second time to stream the results

To implement numresults we'd need to count all matching records instead of stopping at the limit. This will cause a performance penalty to any query that matches more results than the limit. Prefix queries which match very large numbers of URLs will begin to have unpredictable and likely sometimes unacceptable performance.

On a positive note there has been a feature request to just returned counts instead of results and I guess implementing numresults would achieve that when combined with a result limit of zero.

CC @kris-sigur

Warc resources not loading

I'm having trouble with a lot of resources not loading when using tinyCDXserver with OpenWayback 2.2.
So far I can query the CDX fine through OpenWayback, and it appears to load the html of the page, but all the subsequent calls for the resources are not being found (css, js, images etc). A sample of my OpenWayback output

Feb 03, 2016 11:57:26 AM org.archive.wayback.resourceindex.RemoteResourceIndex$1 initialValue SEVERE: Builder is not namespace aware. Feb 03, 2016 11:57:26 AM org.archive.wayback.resourceindex.RemoteResourceIndex$1 initialValue SEVERE: Builder is not namespace aware. Feb 03, 2016 11:57:26 AM org.archive.wayback.resourceindex.RemoteResourceIndex$1 initialValue SEVERE: Builder is not namespace aware. Feb 03, 2016 11:57:26 AM org.archive.wayback.resourceindex.RemoteResourceIndex$1 initialValue SEVERE: Builder is not namespace aware. Feb 03, 2016 11:57:27 AM org.archive.wayback.webapp.AccessPoint handleReplay INFO: LOADFAIL: Timeout: Too many retries, limited to 0 Feb 03, 2016 11:57:27 AM org.archive.wayback.webapp.AccessPoint handleReplay WARNING: (1)LOADFAIL: Self-Redirect: No Closest Match Found /20151108060052/http://www.oversixty.co.nz/ui/css/fonts.css Feb 03, 2016 11:57:27 AM org.archive.wayback.webapp.AccessPoint logError WARNING: Runtime Error org.archive.wayback.exception.ResourceNotAvailableException: Self-Redirect: No Closest Match Found at org.archive.wayback.webapp.AccessPoint.handleReplay(AccessPoint.java:811) at org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:313) at org.archive.wayback.util.webapp.RequestMapper.handleRequest(RequestMapper.java:198) at org.archive.wayback.util.webapp.RequestFilter.doFilter(RequestFilter.java:146) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:999) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:565) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:309) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

I can see the resources in the CDX file, and when I switch to using the same CDX as a localcdxcollection all the resources load fine.
Two things in the log caught my eye but I'm not sure if they are just red herrings:

SEVERE: Builder is not namespace aware.
INFO: LOADFAIL: Timeout: Too many retries, limited to 0

Have you guys come across this at all? Is there any configuration that I might be missing (retry limits, namespaces)?
Any help would be much appreciated.

Support for robotsflag field based on OpenWayback behaviour

We were looking at using the <robotstxt> field, which is currently not supported by tinycdxserver, and were wondering if you'd be happy for us to submit a pull request that enables it?

The implementation in OpenWayback is rather odd, in that it populate this field using the M meta tags (AIF) field (see here). It's not clear why the meta tags field becomes the robotstxt field, but AFAICT this is the only way to populate that field via the CDX format.

It doesn't look like too difficult a change, but given that it's nearly there but commented out I thought I'd better ask if there's a problem? Presumably the indexes won't be compatible either?

OWB compat: annotate 'closest' capture for BubbleCalendar

Looks like we should read the date query field (e.g. /coll?q=date%3A20141115000000+...) and annotate the closest-dated capture with <closest>true</closest>.

See iipc/openwayback#438

Possible race condition under load

I've been running some load tests on OutbackCDX, and as indicated in this comment when I run 1000 threads (all running the same GET 100 times), I start seeing odd errors:

Premature end of chunk coded message body: closing chunk expected
Socket closed

It's rock solid at 750 threads, but at 1000 it goes wonky. The same client works fine at 1000 threads when running against an NGINX instance configured to respond as if it was OutbackCDX.

So, this seems like it might be a subtle race condition under load in OutbackCDX itself?

EDIT Sorry should have mentioned that this is irrespective of the number of threads OutbackCDX is configured to use (as long as it's plenty!) and doesn't see to be related to ulimits (which manifests itself differently).

Are updates thread-safe?

I've been building a Hadoop indexer that sends lots of requests to tinycdxserver. The 42 WARC files in my test are processed on 14 'mappers' and according to the CDX file output, correspond to 735,856 CDX lines.

If I make the map jobs submit the CDX lines one-by-one, I get an estimated number of records in tinycdxserver as being 735,764. Slightly off, but close enough to be down to the estimation method.

If I submit in tens of CDX lines, however, I get 170,101 estimated records. In fives, I get 232,710 and then if I try again I get 224,335.

I was originally submitting in chunks of 10,000 and observed some very odd dynamics. The estimated number would go up and up and then suddenly reset to near zero. The first time the drop happens, it seems at the same time as L0 level turns up in the compaction status. i.e. when this line turns up

  L0      3/0          0   0.8      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.1         0         3    0.047          0       0      0

Shown below in context:

** Compaction Stats [default] **
Level    Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt)  KeyIn KeyDrop
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0          0   0.8      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.1         0         3    0.047          0       0      0
 Sum      3/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      0.1         0         3    0.047          0       0      0
 Int      0/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000          0       0      0
Flush(GB): cumulative 0.000, interval 0.000
Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

** DB Stats **
Uptime(secs): 1442.7 total, 910.9 interval
Cumulative writes: 83 writes, 735K keys, 83 batches, 1.0 writes per batch, ingest: 0.20 GB, 0.14 MB/s
Cumulative WAL: 83 writes, 83 syncs, 0.99 writes per sync, written: 0.20 GB, 0.14 MB/s
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.1 seconds
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 0 writes, 0 keys, 0 batches, 0.0 writes per batch, ingest: 0.00 MB, 0.00 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

Estimated number of records: 105237

It seems to be when the .log file is compacted into a table file. Before disappearing, the .log files appear to contain a lot of repetitive URLs, so maybe there's a clue there.

Better error messages for non-CDX input

e.g. posting the string 'test-integration/test1.cdx' instead of a CDX file (forgetting @ in the curl command-line) currently shows:

$ curl -X POST --data test-integration/test1.cdx  http://localhost:8080/myindex
java.lang.ArrayIndexOutOfBoundsException: 1
At line: test-integration/test1.cdx

Display Rule ID in the Access Control Tool [ACC10]

Currently, each rule in the Access Control Tool (Outback CDX) is assigned a system-generated "Rule ID", which is then used to apply the rules to the entire Web Archive solr index. By making these Rule IDs visible in the Access Control Tool interface, it would be easier to maintain reliable records of action taken as a result of a takedown request.

Handling malformed URIs

We hit an odd edge case. We ended up checking OutbackCDX for some weird URIs thrown up by the crawl, like:

http://allprintjerseyyourlocalembroideryandvinylprintspecialisthomepage/
http://development-social-marketing-strategy-promote-ebola-treatment-seeking-behaviour-sierra-leone/

When querying OutbackCDX, this cause a runtime exception in IDN.toASCII:

java.lang.IllegalArgumentException: The label in the input is too long

Because the domain label is greater than the 63 characters allowed.

Admittedly this is because we're kind of misusing OubackCDX as a crawl status database, rather than a playback index.

That said, would it be worth returning a 400 Bad Request rather than 500? Is there a more elegant way to handle this?

Enhanced audit trail for Access Control [ACC9]

With a user authentication system in place, log the creation and subsequent updates to Access Rules, specifying the username as well as a date/time stamp for each change.
..

At a minimum, as we discussed, could you please add usernames to the Created and Last modified dates, and then staff can add any extra notes they like in the Private Comments?

We may need to add something more sophisticated down the track to log all changes made to access rules, but that’s not considered MVP for this project.

Dashboard access token needs to be refreshed

5 minutes after sign in the dashboard access token is being expired. Need to check and refresh on each API call.

From the KeyCloak docs:

One thing to keep in mind is that the access token by default has a short life expiration so you may need to refresh the access token prior to sending the request. You can do this by the updateToken method. The updateToken method returns a promise object which makes it easy to invoke the service only if the token was successfully refreshed and for example display an error to the user if it wasn’t. For example:
keycloak.updateToken(30).success(function() {
    loadData();
}).error(function() {
    alert('Failed to refresh token');
);

User authentication [ACC8]

Require user authentication/login to access the Access Control Tool. As a possible extension, provide both read-only and read/write access to the system, for increased security.

JWT auth: shouldn't enforce claim existence for read-only users

Handling of `+` characters in queries

Out playback system is making requests like this:

http://bigcdx.n45.wa.bl.uk:9090/data-heritrix?url=http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African+women4.png

and getting no hits. This works though:

http://bigcdx.n45.wa.bl.uk:9090/data-heritrix?url=http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African%2Bwomen4.png

giving

com,blogspot,bp,3)/-w8iwj9tfz-i/utcs2d5pt-i/aaaaaaaaai4/8bcbtlsj3ti/s320/african+women4.png 20130307094428 http://3.bp.blogspot.com/-W8IWj9tFz-I/UTCS2D5Pt-I/AAAAAAAAAI4/8BCbTLsJ3tI/s320/African+women4.png image/png 200 GRDKSKQHAA72NSYKH6UUAOAELGHBKGPW - - 0 1840918 /data/93028436/92898398/WARCS/BL-92898398-20130307094424-00000-safari.bl.uk.warc.gz

Is OutbackCDX not handling + characters correctly? Or is the client expected to handle this escaping, i.e.

http://bigcdx.n45.wa.bl.uk:9090/data-heritrix?url=http%3A%2F%2F3.bp.blogspot.com%2F-W8IWj9tFz-I%2FUTCS2D5Pt-I%2FAAAAAAAAAI4%2F8BCbTLsJ3tI%2Fs320%2FAfrican%2Bwomen4.png

(Currently pywb thinks this is an OutbackCDX problem)

nla / outbackcdx Goto Github PK

outbackcdx's Issues

nginx

S3

Recommend Projects

Recommend Topics

Recommend Org