uga-libraries / web-archive-it-api Goto Github PK

0.0 3.0 0.0 150 KB

Scripts for using the Archive-It APIs to generate reports.

License: Creative Commons Attribution Share Alike 4.0 International

Python 100.00%

web-archive-it-api's Introduction

Archive-It APIs Scripts

Overview

These scripts use the Archive-It web archiving service APIs (Partner API and WASAPI) to generate reports. They are used to prepare for quarterly downloads from Archive-It for preservation and to review and update metadata.

All reports are CSVs. Report scripts in this repository:

collection_metadata_report.py: Example collection metadata report
preservation_download_tracker.py: WARC metadata summarized by seed. Example preservation download tracker
seed_metadata_report.py: Example seed metadata report
warc_csv.py: WARC metadata for all WARCs stored during the specified time frame. Example WARC metadata report

Getting Started

Dependencies

pandas: edit and summarize API output
requests: download content from the APIs

Installation

Prior to using any of these scripts, create a file named configuration.py, modeled after configuration_template.py, and save it to your local copy of this repository. This defines a place for script output to be saved and includes your Archive-It login credentials.

Script Arguments

collection_metadata_report.py

required (optional): add "required" to limit the report to UGA's required collection metadata fields. Otherwise, all fields are included.

preservation_download_tracker.py

warc_metadata_path (required): the location of the WARC metadata report, created using warc_metadata_report.py.

seed_metadata_report.py

required (optional): add "required" to limit the report to UGA's required seed metadata fields. Otherwise, all fields are included.

warc_csv.py

Both date arguments are formatted YYYY-MM-DD and define the date range of WARCs to include.
start_date (required): first store date of WARCs to include.
end_date (required): first store date of WARCs NOT to include (last date included is the day before end_date).

Testing

There are unit tests for each function and the entire script for each of the scripts, except for check_config() (Issue 21) and the API error for get_metadata() (Issue 22). The tests for functions that call the API and for the script rely on UGA Archive-It data. For UGA, the expected results of these tests may need to be updated occasionally to keep in sync with our edits. To use these tests with another account, all expected results must be edited to use data in that account.

Workflow

These scripts are used for two different workflows at UGA:

The reports may also be created and used individually.

Author

Adriane Hanson, Head of Digital Stewardship, University of Georgia

web-archive-it-api's People

Contributors

Watchers

web-archive-it-api's Issues

Test Metadata Update Workflow

Verify that the collection and metadata reports will work to upload a batch of changes to Archive-It.

Add argument error handling function

Now that there are two arguments, both of which have multiple errors (missing or not formatted correctly), make a function to test the arguments for all errors before exiting the script. Also add a new test for if first date is earlier than second date (argv[1] > argv[2])

Include more WASAPI fields

Reviewed what is in WASAPI. Right now, we have everything in the report that we find useful, but there are other things that could be good to have just in case Archive-It ever ceases to exist and we lose access to their copy of the data.

Everything in WASAPI that currently not including in the report:

filetype: Add to have something to compare FITS to, if needed. So far, always warc.
sha1 checksum: Add for future ability to do extra verification
account: No. Always same (UGA's Archive-It number).
crawl-time: Add. Indicates when the website was live (when the WARC was done).
crawl-start: Add. Indicates when the website was live, since store date can be months later.
locations (list of 2 URLs for download). No. Would only need if started using this report for downloading.

Add example reports

It would help others decide if they want to use the script if they can see what the outputs are

Always use script_output from config for saving

Currently, metadata_check_combined.py requires an output directory as an argument and metadata_check_department.py has an optional argument to supply a different output directory. Simpler and more consistent to always use the location in the configuration file. That file is required because it also has the Archive-It credentials, so it will be there.

Split script into collection and seed

For more flexibility, have separate reports for collection metadata and seed metadata.

Include end date

warc_csv.py currently has an earliest date for WARCS to be included but not an end date. In practice, we have wanted WARCs from the earliest date to the present, but it would be more flexible to able to indicate an end date.

Add workflow documentation

Explain the primary uses for each report, either in README or as separate documentation

Expand WASAPI Limits

WASAPI calls currently have a limit of 1000 WARCs and we have more than that now. Test if -1 works to get all (that works with the Partner API) or else increase the current limit.

Test configuration file

Make a shared function to verify the configuration file is present and has the correct variables. Where possible, test the values of the variables are valid, such as a file path exists.

Add data limits

One common use of the seed report is to look for missing metadata for the quarterly preservation download. There are some seeds which will never have complete metadata and always show up in the report. They were just for testing, are departments that don't use the preservation workflow, or were tried but never successfully crawled. Being able to filter these seeds by crawl dates or some other data limit would reduce this noise.

Only include active collections and seeds

Inactive collections and seeds are likely from testing and do not have complete metadata. Because they are not active, they don't need correction and can be ignored.

Unit testing for API error with get_metadata()

Is there a way to cause the API error to happen? Right now, the test does not use the function but instead has the portion of the code assigning and testing the value of status_code to allow an incorrect status to be assigned. Not a priority since this is very simple code.

Request a specific department

metadata_check_departments.py produces a lot of spreadsheets. It would be more convenient to be able to specify which department to run the report for, both for the admin monitoring the whole account and so users from a department can run a report for themselves.

Before deleting this script, review if there are improvements that should be incorporated into metadata_check_combined.py

Add more seed information

For researching missing metadata prior to a preservation download, it would be helpful to have the UGA department (based on AIT collection), date of last crawl, and the status of the crawl (test saved, test unsaved, etc.).

Use CSV for web aip script input

Now that these downloads are bigger, it doesn't work as well for the AIP script to make everything in a single batch. It takes a long time and is prone to crashing. If a CSV was the starting point, it would be easier to split the quarterly download into batches. Right now, the script can restart based on the CSV it generates as the first step, but to split a download into batches from the start I have to manually comment out most of the script to create the CSV.

Better name for spreadsheet?

The current name is converted_warc_xml.csv, which is not very self-explanatory.

Make tracking spreadsheet automatically

I convert the WARC csv into a seed-based spreadsheet to use for tracking the preservation process through ARCHive ingest. This could be created with a script.

Manual steps:

Dedup with columns AIT Collection, Seed, Job, Crawl Def, AIP Title
Check if there are any duplicate seeds due to job or crawl def repeating, and if so merge those values into one row.
Do a subtotal by seed for the size in GB.
Do a subtotal by seed for the number of WARCs.
Add columns for the workflow steps: AIP Errors, QC Part 1, Upload to Ingest, Ingest into ARCHive, QC Part 2, Complete.
Select 3-10 for QC, getting a mix of collections, sizes, and 1 vs multiple WARCs. Put n/a in the QC columns for the rest. For part 2, pick 1 of a small size for copying back and otherwise put metadata only.

Add why title missing in get_seed_metadata()

The title could be missing because the seed isn't in Archive-It or because the seed doesn't have a value in the title metadata field. Right now, either case is given the default value "No title in Archive-It". Look at the error types more closely and give different default values depending on the error.

Low priority because a seed being missing from Archive-It is unlikely to happen.