Coder Social home page Coder Social logo

internetarchive / cdx-writer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rajbot/cdx-writer

20.0 17.0 12.0 5.73 MB

Python script to create CDX index files of WARC data

Home Page: http://archive.org

License: GNU Affero General Public License v3.0

Python 42.98% Arc 57.02%

cdx-writer's Introduction

cdx_writer.py

Python script to create CDX index files of WARC data.

Build Status

Usage

Usage: cdx_writer.py [options] warc.gz

Options:

-h, --help                  show this help message and exit
--format=FORMAT             A space-separated list of fields [default: 'N b a m s k r M S V g']
--use-full-path             Use the full path of the warc file in the 'g' field
--file-prefix=FILE_PREFIX   Path prefix for warc file name in the 'g' field.
                            Useful if you are going to relocate the warc.gz file
                            after processing it.
--all-records               By default we only index http responses. Use this flag
                            to index all WARC records in the file
--screenshot-mode           Special Wayback Machine mode for handling WARCs
                            containing screenshots
--exclude-list=EXCLUDE_LIST File containing url prefixes to exclude
--stats-file=STATS_FILE     Output json file containing statistics

Output is written to stdout. The first line of output is the CDX header. This header line begins with a space so that the cdx file can be passed through sort while keeping the header at the top.

Format

The supported format options are:

M meta tags (AIF) *
N massaged url
S compressed record size
V compressed arc file offset *
a original url **
b date **
g file name
k new style checksum *
m mime type of original document *
r redirect *
s response code *

* in alexa-made dat file
** in alexa-made dat file meta-data line

More information about the CDX format syntax can be found here: http://www.archive.org/web/researcher/cdx_legend.php

Installation

Unfortunately, this script is not propery packaged and cannot be installed via pip. See the .travis.yml file for hints on how to get it running.

Differences between cdx_writer.py and archive-access cdx files

The CDX files produced by the archive-access and that produced by cdx_writer.py differ in these cases:

Differences in SURTs:

  • archive-access doesn't encode the %7F character in SURTs

Differences in MIME Type:

  • archive-access does not parse mime type for large warc payloads, and just returns 'unk'
  • If the HTTP Content-Type header is sent with a blank value, archive-access returns the value of the previous header as the mime type. cdx_writer.py returns 'unk' in this case. Example WARC Record (returns "close" as the mime type): ...Content-Length: 0\r\nConnection: close\r\nContent-Type: \r\n\r\n\r\n\r\n

Differences in Redirect urls:

  • archive-access does not escape whitespace, cdx_writer.py uses %20 escaping so we can split these files on whitespace.
  • archive-access removes unicode characters from redirect urls, cdx_writer.py version keeps them
  • archive-access does not decode html entities in redirect urls
  • archive-access sometimes does not turn relative URLs into absolute urls
  • archive-access sometimes does not remove /../ from redirect urls
  • archive-access uses the value from the previous HTTP header for the redirect url if the location header is empty
  • cdx_writer.py only looks for http-equiv=refresh meta tag inside HEAD element

Differences in Meta Tags:

  • cdx_writer.py only looks for meta tags in the HEAD element
  • archive-access version doesn't parse multiple html meta tags, only the first one
  • archive-access misses FI meta tags sometimes
  • cdx_writer.py always returns tags in A, F, I order. archive-access does not use a consistent order

Differences in HTTP Response Codes

  • archive-access returns response code 0 if HTTP header line contains unicode: HTTP/1.1 302 D\xe9plac\xe9 Temporairement\r\n...

cdx-writer's People

Contributors

dvanduzer avatar galgeek avatar jcushman avatar kngenie avatar nlevitt avatar rajbot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cdx-writer's Issues

cdx_writer.py timeout when large amounts of URI's present in warc

Currently have 71 tasks that have timed out(at least not within ~76k seconds) due to large amounts of URI's in megawarc.

These can be found in the tinypic collection from archiveteam.

Example tasks:
warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76800 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830091905_c83d08f5.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830091905_c83d08f5' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830091905_c83d08f5/cdxstats.json'> '/t/_archiveteam_tinypic_20190830091905_c83d08f5/cdx.txt' failed with exit code: 124, but told to continue on...

warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76764 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830120442_36ec361d.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830120442_36ec361d' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830120442_36ec361d/cdxstats.json'> '/t/_archiveteam_tinypic_20190830120442_36ec361d/cdx.txt' failed with exit code: 124, but told to continue on...
And 69 more

support idx, for 2 level indexes

To deal with large indexes (where sorted cdx data cannot all fit in core), an idx file is produced, containing one line for each chunk of (typically 3000) cdx lines. Each idx record is: the surt + ' ' + date (or cdx key), the name of the cdx file the record refers to, the offset in the file where the key is located, and the length of the gzip record all tab separated, one record per line.

roughly, the following scheme is employed, starting with a sorted cdx_lines_source:

cdx_block_line_count = 3000

idx = open(idx_output,'wb')
cdx = open(cdx_output,'wb')
while cdx_lines_source:
    chunk = list(itertools.islice(cdx_lines_source, cdx_block_line_count))
    b = cStringIO.StringIO()
    g = gzip.GzipFile(fileobj=b, mode='wb')
    g.write('\n'.join(chunk))
    g.write('\n')
    g.close()
    z = b.getvalue()
    cdx_key = ' '.join(chunk[0][0:2])
    idx.write('%s\t%s\t%d\t%d\n' % (cdx_key, cdx_filename, cdx.tell(), len(z)))
    cdx.write(z)

Ideally, the cdx writer could write cdx files with these handy idx index files on the side. Also, would be wonderful if we had a cdx editor which could, using the idx file, allow records in a cdx file to be edited (as long as the gzip block containing the record didn't grow).

Field names should be case-insensitive

Also reported here: https://bitbucket.org/rajbot/warc-tools/issue/1 . I'm creating this issue here on GitHub so others may know about this issue as well.

According to the WARC ISO 28500 Version 1 Latest Draft, Section 4, fields names should be case-insensitive. i.e., Warc-Type should be the same as WARC-Type. Without case-insensitivity, record.type will return None if it doesn't match WARC-Type exactly for example.

cdx_writer.py hangs on long meta tag

I found a case where cdx_writer.py never finish (at least not within reasonable timeout).
cdx_writer.py is stuck at re.match for extracting attribute values from a <meta name="description" content="..." /> whose content attribute value has 360,337 chars.

warc record can be found in NO404-WKP-20131104215558-crawl345/NO404-WKP-20131104222227-08103.warc.gz, at offset 124201734.

Content-Type should not be compared as a fixed string.

I noticed that there is a comparison to get response records:

'application/http; msgtype=response' == record.content_type

According to the spec, leaving out the space is perfectly valid: application/http;msgtype=response. Wget uses this format. Although CDX files are generated, I'm concerned that future use may not work if something changes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.