internetarchive / draintasker Goto Github PK

8.0 6.0 7.0 983 KB

a tool for continuously ingesting w/arc files into the archive

Shell 4.74% Python 94.90% C 0.08% HTML 0.28%

draintasker's Introduction

DRAINTASKER - clears filling disks
================================================================

* Draintasker wiki:
  https://webarchive.jira.com/wiki/display/WEBOPS/Draintasker
* Draintasker bugs:
  https://launchpad.net/archivewidecrawl/+bugs

supports "draining" a running crawler along two paths:

  1) dtmon: IAS3-to-petabox (paired storage)
  2) th-dtmon: catalog-direct-to-thumpers (Santa Clara MD)

run like this:

  $ ssh home
  $ screen
  $ ssh -A crawler
  $ cd /path/draintasker
  $ svn up
  $ emacs dtmon.yml, save-as: /path/drain.yml
  $ dtmon.py /path/drain.yml | tee -a /path/drain.log

PROCESSING 

  monitor job and drain (with dtmon.py) - while DRAINME file exists,
  pack warcs (PACKED), make manifests (MANIFEST), launch transfers
  (TASK), verify transfers (TOMBSTONE), and finally, delete verified
  warcs, then sleep before trying again.

under-the-hood:

  dtmon.py config
  |
  '-> s3-drain-job job_dir xfer_job_dir max_size warc_naming
     |
     '-> pack-warcs.sh             => PACKED
     '-> make-manifests.sh         => MANIFEST
     '-> s3-launch-transfers.sh    => LAUNCH TASK [RETRY] SUCCESS TOMBSTONE
     '-> delete-verified-warcs.sh  => poof, no w/arcs!
  
  th-dtmon.sh config
  |
  '-> drain-job job_dir xfer_job_dir thumper max_size warc_naming
      |
      '-> pack-warcs.sh            => PACKED
      '-> make-manifests.sh        => MANIFEST
      '-> launch-transfers.sh      => LAUNCH, TASK
      '-> verify-transfers.sh      => SUCCESS, TOMBSTONE
      '-> delete-verified-warcs.sh => poof, no w/arcs!

get status of prerequisites and disk capacity like this:

  $ get-status.sh crawldata_dir xfer_dir

some advice:

  1) if there are old draintasker procs, kill them.
  2) if files in the way, investigate and move aside, 
     eg mv LAUNCH.open LAUNCH.1, mv ERROR ERROR.1
        good to number each failure/error file
  3) check the status of your disks
     ./get-status.sh
  4) (optional) test petabox-to-thumper path on single series
     ./launch-transfers.sh 
  5) log into home and open a screen session
  6) in screen, ssh crawler, cd /path/draintasker/, svn up
  7) run dtmon.py to continuously drain each job+disk
     [screen]
       cd /path/draintasker
       ./dtmon.py /path/disk1.yml
     [screen]
       cd /path/draintasker
       ./dtmon.py /path/disk3.yml

CONFIGURATION

directory structure

  crawldata     /{1,3}/crawling
  rsync_path    /{1,3}/incoming
  job_dir       /{crawldata}/{job_name}
  xfer_job_dir  /{rsync_path}/{job_name}
  warc_series   {xfer_job_dir}/{warc_series}

depending on config, your warcs might be written in e.g.

  /1/crawling/{crawljob}/warcs
  /3/crawling/{crawljob}/warcs

and be "packed" into 

  /1/incoming/{crawljob}/{warc_series}/MANIFEST
  /3/incoming/{crawljob}/{warc_series}/MANIFEST
    
DEPENDENCIES

  dtmon.py (IAS3-to-petabox)
    + HOME/.ias3cfg (when using dtmon.py)
    + add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki)
  th-dtmon.sh (catalog-direct-to-thumper)
    + ~/.wgetrc with your archive.org user cookies (see wiki)
    + ensure user petabox user exists: /home-local/petabox
    + PETABOX_HOME=/home/user/petabox (codebase from svn)
    + get petabox authorized_keys from "draintasking" crawler
      @crawling08:~$ scp /home-local/petabox/.ssh/authorized_keys\
      root@ia400131:/home-local/petabox/.ssh/authorized_keys
    + add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki)

PREREQUISITES

  DRAINME       {job_dir}/DRAINME
  FINISH_DRAIN  {job_dir}/FINISH_DRAIN
  PACKED        {warc_series}/PACKED
  MANIFEST      {warc_series}/MANIFEST
  LAUNCH        {warc_series}/LAUNCH
  TASK          {warc_series}/TASK
  TOMBSTONE     {warc_series}/TOMBSTONE

if you see a RETRY file, eg RETRY.1284217446 the suffix is the epoch
time when a non-blocking retry was scheduled. if this file exists,
then the retry was attempted at some time after that. you can get the
human readable form of that time with the date cmd, like so:

  date -d @1284217446
  Sat Sep 11 15:04:06 UTC 2010

DRAIN DAEMON

  dtmon.py      run s3-drain-job periodically
  th-dtmon.sh   run drain-job periodically
  drain-job.sh  run draintasker processes in single mode

DRAIN PROCESSING

  delete-verified-warcs.sh  delete original (verified) w/arcs from each series 
  get-remote-warc-urls.sh   report remote md5 and url for all filesxml in series 
  item-submit-task.sh       submit catalog task for series
  item-verify-download.sh   wget remote w/arc and verify checksum for series 
  item-verify-size.sh       verify remote size of w/arc series
  launch-transfers.sh       submit transfer tasks for series
  make-manifests.sh         compute md5s into series MANIFEST
  pack-warcs.sh             create warc series when available
  s3-launch-transfers.sh    invoke curl for series
  task-check-success.sh     check and report task success by task_id
  verify-transfers.sh       run task-check-success and item-verify for series 

UTILS

  get-status.sh              report dtmons, prerequisites and disk usage 

  addup-warcs.sh             report count and total size of warcs
  bundle-crawl-artifacts.sh  make tarball of crawldata for permastorage 
  check-crawldata-staged.sh  report staged crawldata file count+size
  check-crawldata.sh         report source crawldata file count+size
  copy-crawldata.sh          copy all crawldata preserving dir structure 
  make-and-store-bundle.sh   make bundles and scp to staging

----
siznax 2010

draintasker's People

Contributors

Stargazers

Watchers

Forkers

rlugojr adam-miller cclauss corentinb ibnesayeed kngenie openaccess

draintasker's Issues

print error description in non-200 PUT response for easier diagnosis

S3 now returns 400 for MD5 mismatch. It is often difficult for users to know what went wrong without error description.

upload fails with 400 because s3-launch-transfer sends bad Content-MD5 header

this happens when:

there are two WARCs A and B,
filename A is a substring of filename B

s3-launch-transfer.sh uses naiive grep for retrieving MD5 hash for A from MANIFEST, resulting in extracting two MD5 hashes with newline in between, which goes straight into --header option for curl.

Undefined name `p` in admin.py

Undefined names have the potential to raise NameError at runtime.

% python2 -m flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

./admin.py:63:37: F821 undefined name 'p'
            self.write(dict(ok=0, s=p.s, error=str(ex)))
                                    ^
./admin.py:84:38: F821 undefined name 'p'
        self.write(dict(pj=pj.id, sw=p.sw, ok=1))
                                     ^
2     F821 undefined name 'p'
2

finish the half-baked support for custom pattern/template to WARC_naming and item_naming

draintasker has incomplete support for specifying arbitrary pattern template to WARC_naming parameter, instead of just "1" or "2". complete this work.
also, config.py should check if all field references in item_naming template are valid.

S3 failure when item not yet created

@kngenie: Draintasker is trying to upload the second file to the item just created, and item wasn't created yet.

ERROR: S3 PUT failed with response_code: 404 at 2020-03-26T22:48:22PDT
BLOCK: sleep for 120 seconds...

@kngenie: thinking now, draintasker could just keep uploading files with "create item" option, rather than being cautious

Scripts cannot be run on vanilla BusyBox distros such as Alpine

Description of the issue

pack-warcs.sh has a dependency to GNU findutils package and thus cannot be run on systems or Docker containers without the findutils package due to the REGEX not being correctly interpreted.

Troubleshoot and findings

Running an Alpine docker container with correct dtmon.cfg configuration and running draintasker/pack-warcs.sh /crawls/dtmon.cfg 1 single with the BusyBox find binary give me this output :

  job_dir          = /crawls/warcs
  xfer_home        = /crawls/sink
  warc_naming      = {prefix}-{timestamp}-{serial}-{host}
  item_naming      = {prefix}-{timestamp14}{suffix}-{shost}
  max_series_size  = 10737418240 (10GB)
  total_num_warcs  = 0
  total_size_warcs = 0
  FINISH_DRAIN     = /crawls/warcs/FINISH_DRAIN
  OPEN             = /crawls/warcs/PACKED.open
  mode             = single
  compactify       = 1

Troubleshooting it with set -xe shows that the computed find command at pack-warcs.sh:L184 gives no output :

[...]
++ find /crawls/warcs -maxdepth 1 -regex '.*/\(.*\)-\(.*\)-\(.*\)-\(.*\)\.w?arc\(\.gz\)?$'
+ echo '  job_dir          = /crawls/warcs'
[...]

Running the same experiment on the same Docker container with findutils installed with apk add findutils gives me the following output clearly indicating that the find command at pack-warcs.sh:L184 is now working as intended :

[...]
++ find /crawls/warcs -maxdepth 1 -regex '.*/\(.*\)-\(.*\)-\(.*\)-\(.*\)\.w?arc\(\.gz\)?$'
+ for w in $(find $job_dir -maxdepth 1 -regex "${WARC_NAME_RE_FIND}")
+ (( total_num_warcs++ ))
[...]

Potential solutions

I propose to either :

rewrite parts of pack-warcs.sh and other impacted scripts (to be defined) to make draintasker POSIX-compliant
indicate in the README.md that draintasker is intended to be run on GNU systems only
check for the presence of GNU findutils at the start of dtmon.py and exit with an error message if it's not present
continue troubleshooting of the different scripts and check which parts are failing on Alpine before taking a decision

add config option for disabling progress output from curl

allow for specifying arbitrary item metadata

currently item metadata configurable in YAML is limited to small set of commonly used ones. there's a demand for configuring other metadata, most notably "noindex". instead of keep adding explicit support for metadata, allow users to specify arbitrary metadata, under "metadata" property of YAML config.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.