uga-libraries / web-aip Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 1.02 MB

Downloads web content captured using Archive-It and gets it ready to be transformed into AIPs for preservation.

License: Creative Commons Attribution Share Alike 4.0 International

Python 100.00%

web-aip's People

Contributors

Stargazers

Watchers

Forkers

hamzahassan66

web-aip's Issues

Future Warning: incompatible dtype

Getting a Future Warning for an incompatible dtype (string) when adding log text to seed_df or fill blanks with "" in unit tests. The columns have the dtype float64, which is the type assigned to the log columns when seed_df is created in seed_data() and the columns are blank. These columns are intended for strings.

Convert all columns to string: https://stackoverflow.com/questions/22005911/convert-columns-to-string-in-pandas. But before doing that, need to review the code for anything to convert back to a number for math operations. If it causes too many problems, may need to change each of the log columns to a string separately.

Check WARC name in check_aip?

Check for WARC names instead of WARC count? This would catch an error where there is an equal number of missing and extra WARCs, but that is unlikely to happen so not a priority right now.

Use script output folder for unit test outputs

Currently, unit tests are saving to the tests folder in the script repo, which is the current working directory (at least in PyCharm). In the real script, content is in the script output directory.

Could change the current directory to the script output directory in order to be closer to the real script function and to make deletion easier at the end (delete script output directory and all its contents instead of each seed_id folder, plus seeds_log.csv and completeness.csv). But the tests are also probably good enough as is.

Improve test for warc_download.py script

Add a test for restarting. Make the seed_df and preservation_download folder with some seed folders by hand in the test, to simulate it being in progress, and then run the script. I can’t break the script mid-progress, since all the test does is give it the initial arguments.

May also be able to add running general-aip.py with the warc_download.py script output, assuming that the general-aip repo is a sibling of the web-aip repo.

Memory error from download_warcs

With my work desktop machine, get a memory error after downloading a few warcs. This has not happened on my home laptop. Might need to clear the memory of the previous warc at the start of the loop before the API call of the next one. Look into garbage collect.

Switch collection used in test_get_report

Archive-It has added a few fields to the collection report which will change with every crawl: last_crawl_date, num_active_seeds, num_inactive-seeds, and total_warc_bytes. Change the test collection to one we're no longer adding to, so that the expected results are more stable.

Improve unit testing for seed_data()

Split into more functions and create tests for each function. Include tests for the CSV contents and not just the dataframe. Wait to do this until switch to CSV input for warc_download.py, since it will require a rework of this function.

Add API error test to unit tests for get_warc_info()

Giving the function a WARC name that is not in Archive-It cases an index error. I was not able to figure out how to cause an API error yet.

Use general-aip to make AIP?

Currently, web_aip_batch.py downloads all the content, runs the functions also used by general_aip.py to create AIPs, and checks the completeness. This requires manually copying the aip_functions.py file from the general-aip repository. It could be easier to keep in sync if this script just downloaded the content from Archive-It and organized it so it worked for the general_aip.py script. It would mean needing to run a second script after the first finished, but given how time consuming both are, it probably wouldn't be that much of a delay and would give a natural pausing spot if the computer was needed for something else.

Changes needed (off the top of my head, there may be more): web-aip should not make metadata and objects folders and needs to make metadata.csv used by general-aip. And general-aip would need to be able to sort web metadata and possible do a fixity check since the files could be sitting for a while before the general-aip script runs.

Unit tests for linux_unzip.py script

There are not any tests for the linux_unzip.py script. Will need to add functions first to support better testing.

Function for argument testing

There are multiple tests of the 2 arguments, so they are better suited to a function. Also makes testing easier.

Stop functions from printing during unit tests

Functions that print: check_aips(), check_config(), seed_data(), verify_warc_fixity() - from md5deep, warc_download.py

Description: During a unit test in PyCharm, anything the function prints displays red in the terminal, so it initially looks like the test may have failed even if it passed. Is there a way to keep a function from printing during a test?

Limiting departments that download

With the changes to when the AIP ID is calculated, it no longer checks for the department to skip the download of any which do not use this script. It doesn't happen often and will no longer be a problem once a CSV is used as input for warc_download.py, since the CSV can be reviewed and anything deleted that isn't wanted.

But if it takes longer than expected to implement the CSV input or if there are more cases of not wanting to download something, revisit the decision.

Use CSV for script input

This script is often run multiple times because of lost API connections. To restart more gracefully, if there is already a log csv, it uses that information. If it is the first time, it creates the CSV from the API and continues. It would be more flexible to always use a CSV so that the archivist can limit which seeds to include before running the script the first time. There are cases where we crawl something and decide not to preserve it.

It will still need to make aip_log.csv and the script output directories if they aren't already there.

Run on Linux?

The last few downloads, every WARC has had an unzip error from gzip, which requires running a separate script to unzip in Linux. Could the warc_download.py script be edited to work in a Linux environment, allowing the entire process to run seamlessly? If not, could remove unzipping from warc_download.py and always wait to do that in Linux.

The attempt to unzip is slow in Windows. And the longer it takes to run, the more likely it gets interrupted by UGA computer updates or other issues.

Upgrade numpy to 1.21

GitHub Dependabot alert.

Eliminate update_web_aip.py?

The purpose of this script is to finish an AIP when it didn't fully download completely. The archivist manually downloads remaining WARCs from the Archive-It interface and then creates AIPs from them. This is very close to what aip_from_download.py is doing, so it should be possible to update that a little and not need both scripts.

The main difference is that with update_web_aip.py, the AIP was already in a bag and had AIP metadata, so a cleanup function may be needed to get it ready for the usual AIP steps. There is a function reset_aip in web_functions.py, but its purpose is to delete the AIP directory and log information so that the AIP can be completely remade.

Simplify function arguments

Review all function arguments. Are there times things are in seed but also a separate argument? Are there ones I could eliminate? Would it help to have a seed class to store extra information that is always passed with seed? Is there any way to update the log without passing seed, row index, and seed_df every time?

Pandas future warning for making seed_df

In seed_data function, get this error:

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

seed_df = seed_df.append(seed_info, ignore_index=True)

Avoid hard coding departments

Throughout the scripts, use specific UGA department names and codes to determine logic. Is there a way to use the configuration file for this information so it is easier to update, such as when a new department starts using the Archive-It subscription? It would need things like the regular expressions for their AIP ids, not just name and code.

Log successful seed report redaction

Right now, it just logs if the information was not there to redact. Other steps log success as well as errors.

Use different delimiter in seeds.csv

Currently use a comma when there are multiple data points, like multiple crawl definition ids. However, when the CSV is opened in Excel, it reformats these to a single number. For example, "171234, 174567" becomes 171,234,174,567. Use a semicolon or pipe so that Excel doesn't reformat.

Add logging to metadata_csv() function

It does put information into the metadata.csv itself about the errors encountered, but it would be helpful for the log to have a summary of error/no error to know if the metadata.csv needs review or not.

Dynamically Increase WASAPI Limits

WASAPI calls currently have a limit of 1000 WARCs and we have more than that now. Test if -1 works to get all (that works with the Partner API) or else increase the current limit.

Unit tests review

I made a lot of unit tests at once and then edited a lot of those functions. Do a review to make sure each test has all the necessary variations and are as close as possible to how the function works in production (input data types, where it saves, etc.). Also look for things that could more efficient, like using functions for repetitive code (e.g., read CSV into pandas and df to list) or make seed_df within setUp and use row index to access the right row for each test, instead of a separate seed_df for each test. It makes it easier to update later when there are changes to seed_df.

Reviewed all unit tests and made some changes when moved to Linux. Additional ideas are below or in separate issues.

Add department prefix to "no collection" default id

Hargrett and MAGIL both use 0000 to indicate that there is no related collection, which is used later to skip related collection when making the preservation.xml. However, when a CSV is opened in Excel, it converts that number to 0. Use harg-0000 and magil-0000 so that the numbers not changed.

Make unit tests for check_config() function

I haven't figured out how to set up tests for this, since the configuration file is ready when the script runs. Maybe I can run the script using subprocess.run() for each unit test, but if so how do I get the output of the result? The function is fairly simple code and I've tested it manually (change the configuration.py file and run the script again from the command line), so this is a lower priority.

Unit tests for check_seeds() function

Waiting to create these tests until that function is updated with the change to using a CSV as the warc_download.py script because this function will change radically then. It would be hard to test now because it would require finding date ranges with the correct variations but no so much content that it takes a very long time to load.

Variations needed:

Missing folder
None of the six metadata CSVs
All six metadata CSVs once
All six metadata CSVs, with more than one crawl definition and crawl job
WARC count doesn't match
Unexpected file type
Unexpected folder

Base AIP completeness check on CSV

Right now, the check_aips function downloads all WARC metadata from the API and then analyses it to determine what should have been in the download, which does not work when a quarterly download is split into batches or when not all seeds are included in a preservation download. Once a CSV is used from the start to define what should be in a batch, use that as the point of comparison.

Use existing WARCs when restart

If the API times out or the script breaks in the middle of creating an AIP, it currently has to be deleted before the script runs again in order for it to be correctly finished. For AIPs with a lot of WARCs, this can mean a lot of wasted time. Is there a way to have the script be smarter about a restart and use existing metadata and WARC files if they were logged as successful? Or is it safer to start over?

Add more unit tests for error handling to download_warcs()

The function catches errors raised by three other functions: API error for getting the WARC metadata, API error for downloading the WARC, and fixity change after download. 2/3 functions raise two different error types. They all use the same code pattern to handle the raised error.

Currently, the only error handling test is for API error for getting the WARC metadata (input is a WARC name that is not in Archive-It). Add tests for the rest if possible. I'm not sure how to cause them to happen mid-function though.

WARC-based logging

The current log is seed-based, and while that does work well for the AIP creation steps, it gets hard to review for seeds with many WARCs. Logging for WARC-specific steps has a list of the success or failure for each WARC and it is not immediately clear if there were errors. There could be a summary in the seed log of the success and a WARC-specific log with the steps that just impact them. Or the entire log could be WARC based, up through getting everything downloaded, and a separate log could be seed/AIP based and have the rest of the steps. The logs are currently merged at the end of the script, so it would be easy to keep them separate.

Check if multiple reports

In download_metadata(): could the API call in get_report() return more than one report, and if so is that an error or just another line in the CSV?

Replace md5deep with hashdeep

The hashdeep library is quicker for calculating MD5 than running md5deep from within the script. Encountered memory errors, so need to add the code which lets hashdeep work on pieces of a file at a time.

Improve format identification

Now that we're unzipping the downloaded WARCs, FITS can identify the WARC format. But we're also getting some false identifications: Hypertext Markup Language, Plain Text, and WARC version 1.0 with extra text after it. Clean up these identifications so they don't make problems for future preservation watch and format migration decisions.

Still discussing the best way to do that with Emmeline. Could involve limiting the tools in FITS that identify WARCs, adjusting the data afterwards, or a combination of both approaches.

Improve unzipping time

Currently using 7zip command line called from the Python script and it is very slow, much slower than when 7zip is used separately (30+ minutes vs. 1 minute). Try a Python library for unzipping.

Make test_download_warcs work without changing directory

Currently, test_script.py does not work when running all the tests in the tests folder because it relies on the current directory being the tests directory and test_download_warcs is changing the current directory in setUp and tearDown. I can't remember why changing it in setUp was necessary. That, in turned, made it necessary to change for tearDown to be able to delete the folder created during the test.

The current directory is also changed in test_unzip_warc, but since it runs after test_script.py, it isn't impacting anything. The rest of the tests are not dependent on what the current working directory is.

The work around until this is fixed is to run all the tests in the tests folder (ignoring the errors from test_script.py) and then run test_script.py separately.

Seed Report Redaction Issue

Partner API is inconsistent about if the login username and password fields are included for the same seed. Ask Archive-It about the behavior. It might be due to credentials expiring.

Move script output folders

Have the script output folders (aips-to-ingest, preservation-xml, etc.) be made in the AIPs directory and skipped when start iterating instead of having to move them at the end. That way, they are already in a the right place if the script gets interrupted.

Use MD5 in log instead of API call

The log now has the MD5 from Archive-It, so don't need to wait on an API call to verify the MD5 is unchanged prior to unzipping. Can just read the log.

Eliminate web_aip_single?

The web_aip_single.py script is out of sync with improvements to web_aip_batch.py and is no longer necessary if change web_aip_batch.py to take a CSV as the input instead of only using dates to limit what is downloaded. If only one AIP is needed, that can be all that is included in the CSV. Even with date bounds, in most cases giving a single day would not download that much extra data, but it could if someone had a heavy crawling day.

If it is kept, it needs to be updated.