internetarchive / cdx-writer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rajbot/cdx-writer

20.0 17.0 12.0 5.73 MB

Python script to create CDX index files of WARC data

Home Page: http://archive.org

License: GNU Affero General Public License v3.0

Python 42.98% Arc 57.02%

cdx-writer's Introduction

cdx_writer.py

Python script to create CDX index files of WARC data.

Usage

Usage: cdx_writer.py [options] warc.gz

Options:

-h, --help                  show this help message and exit
--format=FORMAT             A space-separated list of fields [default: 'N b a m s k r M S V g']
--use-full-path             Use the full path of the warc file in the 'g' field
--file-prefix=FILE_PREFIX   Path prefix for warc file name in the 'g' field.
                            Useful if you are going to relocate the warc.gz file
                            after processing it.
--all-records               By default we only index http responses. Use this flag
                            to index all WARC records in the file
--screenshot-mode           Special Wayback Machine mode for handling WARCs
                            containing screenshots
--exclude-list=EXCLUDE_LIST File containing url prefixes to exclude
--stats-file=STATS_FILE     Output json file containing statistics

Output is written to stdout. The first line of output is the CDX header. This header line begins with a space so that the cdx file can be passed through sort while keeping the header at the top.

Format

The supported format options are:

M meta tags (AIF) *
N massaged url
S compressed record size
V compressed arc file offset *
a original url **
b date **
g file name
k new style checksum *
m mime type of original document *
r redirect *
s response code *

* in alexa-made dat file
** in alexa-made dat file meta-data line

More information about the CDX format syntax can be found here: http://www.archive.org/web/researcher/cdx_legend.php

Installation

Unfortunately, this script is not propery packaged and cannot be installed via pip. See the .travis.yml file for hints on how to get it running.

Differences between cdx_writer.py and archive-access cdx files

The CDX files produced by the archive-access and that produced by cdx_writer.py differ in these cases:

Differences in SURTs:

archive-access doesn't encode the %7F character in SURTs

Differences in MIME Type:

archive-access does not parse mime type for large warc payloads, and just returns 'unk'
If the HTTP Content-Type header is sent with a blank value, archive-access returns the value of the previous header as the mime type. cdx_writer.py returns 'unk' in this case. Example WARC Record (returns "close" as the mime type): ...Content-Length: 0\r\nConnection: close\r\nContent-Type: \r\n\r\n\r\n\r\n

Differences in Redirect urls:

archive-access does not escape whitespace, cdx_writer.py uses %20 escaping so we can split these files on whitespace.
archive-access removes unicode characters from redirect urls, cdx_writer.py version keeps them
archive-access does not decode html entities in redirect urls
archive-access sometimes does not turn relative URLs into absolute urls
archive-access sometimes does not remove /../ from redirect urls
archive-access uses the value from the previous HTTP header for the redirect url if the location header is empty
cdx_writer.py only looks for http-equiv=refresh meta tag inside HEAD element

Differences in Meta Tags:

cdx_writer.py only looks for meta tags in the HEAD element
archive-access version doesn't parse multiple html meta tags, only the first one
archive-access misses FI meta tags sometimes
cdx_writer.py always returns tags in A, F, I order. archive-access does not use a consistent order

Differences in HTTP Response Codes

archive-access returns response code 0 if HTTP header line contains unicode: HTTP/1.1 302 D\xe9plac\xe9 Temporairement\r\n...

cdx-writer's People

Contributors

Stargazers

Watchers

Forkers

alard jcushman kngenie nlevitt rlugojr arkiver2 vbanos ldko galgeek dvanduzer cclauss openaccess

cdx-writer's Issues

cdx_writer.py timeout when large amounts of URI's present in warc

Currently have 71 tasks that have timed out(at least not within ~76k seconds) due to large amounts of URI's in megawarc.

These can be found in the tinypic collection from archiveteam.

Example tasks:
warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76800 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830091905_c83d08f5.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830091905_c83d08f5' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830091905_c83d08f5/cdxstats.json'> '/t/_archiveteam_tinypic_20190830091905_c83d08f5/cdx.txt' failed with exit code: 124, but told to continue on...

warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76764 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830120442_36ec361d.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830120442_36ec361d' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830120442_36ec361d/cdxstats.json'> '/t/_archiveteam_tinypic_20190830120442_36ec361d/cdx.txt' failed with exit code: 124, but told to continue on...
And 69 more

support idx, for 2 level indexes

To deal with large indexes (where sorted cdx data cannot all fit in core), an idx file is produced, containing one line for each chunk of (typically 3000) cdx lines. Each idx record is: the surt + ' ' + date (or cdx key), the name of the cdx file the record refers to, the offset in the file where the key is located, and the length of the gzip record all tab separated, one record per line.

roughly, the following scheme is employed, starting with a sorted cdx_lines_source:

cdx_block_line_count = 3000

idx = open(idx_output,'wb')
cdx = open(cdx_output,'wb')
while cdx_lines_source:
    chunk = list(itertools.islice(cdx_lines_source, cdx_block_line_count))
    b = cStringIO.StringIO()
    g = gzip.GzipFile(fileobj=b, mode='wb')
    g.write('\n'.join(chunk))
    g.write('\n')
    g.close()
    z = b.getvalue()
    cdx_key = ' '.join(chunk[0][0:2])
    idx.write('%s\t%s\t%d\t%d\n' % (cdx_key, cdx_filename, cdx.tell(), len(z)))
    cdx.write(z)

Ideally, the cdx writer could write cdx files with these handy idx index files on the side. Also, would be wonderful if we had a cdx editor which could, using the idx file, allow records in a cdx file to be edited (as long as the gzip block containing the record didn't grow).

Possibility of writing a file name without any path in `g` field?

If you instantiate a CDX_Writer with a file argument that contains a relative or absolute path and you use the default of use_full_path=False and file_prefix=None the g field is written with the relative or absolute path as given in file. Is that the intended behavior or is it meant to write the file name only as you would get with something like self.warc_path = os.path.basename(file)?

Field names should be case-insensitive

Also reported here: https://bitbucket.org/rajbot/warc-tools/issue/1 . I'm creating this issue here on GitHub so others may know about this issue as well.

According to the WARC ISO 28500 Version 1 Latest Draft, Section 4, fields names should be case-insensitive. i.e., Warc-Type should be the same as WARC-Type. Without case-insensitivity, record.type will return None if it doesn't match WARC-Type exactly for example.

cdx_writer.py hangs on long meta tag

I found a case where cdx_writer.py never finish (at least not within reasonable timeout).
cdx_writer.py is stuck at re.match for extracting attribute values from a <meta name="description" content="..." /> whose content attribute value has 360,337 chars.

warc record can be found in NO404-WKP-20131104215558-crawl345/NO404-WKP-20131104222227-08103.warc.gz, at offset 124201734.

Content-Type should not be compared as a fixed string.

I noticed that there is a comparison to get response records:

'application/http; msgtype=response' == record.content_type

According to the spec, leaving out the space is perfectly valid: application/http;msgtype=response. Wget uses this format. Although CDX files are generated, I'm concerned that future use may not work if something changes.