Coder Social home page Coder Social logo

web-archive-it-api's Introduction

Archive-It APIs Scripts

Overview

These scripts use the Archive-It web archiving service APIs (Partner API and WASAPI) to generate reports. They are used to prepare for quarterly downloads from Archive-It for preservation and to review and update metadata.

All reports are CSVs. Report scripts in this repository:

Getting Started

Dependencies

  • pandas: edit and summarize API output
  • requests: download content from the APIs

Installation

Prior to using any of these scripts, create a file named configuration.py, modeled after configuration_template.py, and save it to your local copy of this repository. This defines a place for script output to be saved and includes your Archive-It login credentials.

Script Arguments

collection_metadata_report.py

  • required (optional): add "required" to limit the report to UGA's required collection metadata fields. Otherwise, all fields are included.

preservation_download_tracker.py

  • warc_metadata_path (required): the location of the WARC metadata report, created using warc_metadata_report.py.

seed_metadata_report.py

  • required (optional): add "required" to limit the report to UGA's required seed metadata fields. Otherwise, all fields are included.

warc_csv.py

  • Both date arguments are formatted YYYY-MM-DD and define the date range of WARCs to include.
  • start_date (required): first store date of WARCs to include.
  • end_date (required): first store date of WARCs NOT to include (last date included is the day before end_date).

Testing

There are unit tests for each function and the entire script for each of the scripts, except for check_config() (Issue 21) and the API error for get_metadata() (Issue 22). The tests for functions that call the API and for the script rely on UGA Archive-It data. For UGA, the expected results of these tests may need to be updated occasionally to keep in sync with our edits. To use these tests with another account, all expected results must be edited to use data in that account.

Workflow

These scripts are used for two different workflows at UGA:

The reports may also be created and used individually.

Author

Adriane Hanson, Head of Digital Stewardship, University of Georgia

web-archive-it-api's People

Contributors

amhanson9 avatar dependabot[bot] avatar

Watchers

Lucian avatar James Cloos avatar  avatar

web-archive-it-api's Issues

Add argument error handling function

Now that there are two arguments, both of which have multiple errors (missing or not formatted correctly), make a function to test the arguments for all errors before exiting the script. Also add a new test for if first date is earlier than second date (argv[1] > argv[2])

Include more WASAPI fields

Reviewed what is in WASAPI. Right now, we have everything in the report that we find useful, but there are other things that could be good to have just in case Archive-It ever ceases to exist and we lose access to their copy of the data.

Everything in WASAPI that currently not including in the report:

  • filetype: Add to have something to compare FITS to, if needed. So far, always warc.
  • sha1 checksum: Add for future ability to do extra verification
  • account: No. Always same (UGA's Archive-It number).
  • crawl-time: Add. Indicates when the website was live (when the WARC was done).
  • crawl-start: Add. Indicates when the website was live, since store date can be months later.
  • locations (list of 2 URLs for download). No. Would only need if started using this report for downloading.

Add example reports

It would help others decide if they want to use the script if they can see what the outputs are

Always use script_output from config for saving

Currently, metadata_check_combined.py requires an output directory as an argument and metadata_check_department.py has an optional argument to supply a different output directory. Simpler and more consistent to always use the location in the configuration file. That file is required because it also has the Archive-It credentials, so it will be there.

Include end date

warc_csv.py currently has an earliest date for WARCS to be included but not an end date. In practice, we have wanted WARCs from the earliest date to the present, but it would be more flexible to able to indicate an end date.

Expand WASAPI Limits

WASAPI calls currently have a limit of 1000 WARCs and we have more than that now. Test if -1 works to get all (that works with the Partner API) or else increase the current limit.

Test configuration file

Make a shared function to verify the configuration file is present and has the correct variables. Where possible, test the values of the variables are valid, such as a file path exists.

Add data limits

One common use of the seed report is to look for missing metadata for the quarterly preservation download. There are some seeds which will never have complete metadata and always show up in the report. They were just for testing, are departments that don't use the preservation workflow, or were tried but never successfully crawled. Being able to filter these seeds by crawl dates or some other data limit would reduce this noise.

Only include active collections and seeds

Inactive collections and seeds are likely from testing and do not have complete metadata. Because they are not active, they don't need correction and can be ignored.

Unit testing for API error with get_metadata()

Is there a way to cause the API error to happen? Right now, the test does not use the function but instead has the portion of the code assigning and testing the value of status_code to allow an incorrect status to be assigned. Not a priority since this is very simple code.

Request a specific department

metadata_check_departments.py produces a lot of spreadsheets. It would be more convenient to be able to specify which department to run the report for, both for the admin monitoring the whole account and so users from a department can run a report for themselves.

Before deleting this script, review if there are improvements that should be incorporated into metadata_check_combined.py

Add more seed information

For researching missing metadata prior to a preservation download, it would be helpful to have the UGA department (based on AIT collection), date of last crawl, and the status of the crawl (test saved, test unsaved, etc.).

Use CSV for web aip script input

Now that these downloads are bigger, it doesn't work as well for the AIP script to make everything in a single batch. It takes a long time and is prone to crashing. If a CSV was the starting point, it would be easier to split the quarterly download into batches. Right now, the script can restart based on the CSV it generates as the first step, but to split a download into batches from the start I have to manually comment out most of the script to create the CSV.

Make tracking spreadsheet automatically

I convert the WARC csv into a seed-based spreadsheet to use for tracking the preservation process through ARCHive ingest. This could be created with a script.

Manual steps:

  • Dedup with columns AIT Collection, Seed, Job, Crawl Def, AIP Title
  • Check if there are any duplicate seeds due to job or crawl def repeating, and if so merge those values into one row.
  • Do a subtotal by seed for the size in GB.
  • Do a subtotal by seed for the number of WARCs.
  • Add columns for the workflow steps: AIP Errors, QC Part 1, Upload to Ingest, Ingest into ARCHive, QC Part 2, Complete.
  • Select 3-10 for QC, getting a mix of collections, sizes, and 1 vs multiple WARCs. Put n/a in the QC columns for the rest. For part 2, pick 1 of a small size for copying back and otherwise put metadata only.

Add why title missing in get_seed_metadata()

The title could be missing because the seed isn't in Archive-It or because the seed doesn't have a value in the title metadata field. Right now, either case is given the default value "No title in Archive-It". Look at the error types more closely and give different default values depending on the error.

Low priority because a seed being missing from Archive-It is unlikely to happen.

Improve metadata validation

A common use for the report is checking for metadata that is missing or incorrect. Highlight blanks in required fields and confirm that values meet requirements, such as how rights statements and dates are formatted.

Unit test for check_config()

I don't know how to test variations of the configuration file, since the data is read on import. Might need to run the script using subprocess.run() from within each test.

Output Excel instead of CSV

To store and edit the preservation download tracker in Teams, it needs to be Excel. Have the script produce Excel instead of converting once in Teams. This is just a temporary file, so it doesn't need to be the CSV preservation format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.