uga-libraries / accessioning-scripts Goto Github PK

5.0 2.0 1.0 481 KB

Scripts used for accessioning born-digital archives

Python 99.99% HTML 0.01%

accessioning-scripts's Introduction

accessioning-scripts

Scripts used for accessioning born-digital archives at the UGA Special Collections Libraries.

The typical directory structure for accessions is a folder named with the collection id, which contains one folder per accession named with the accession id, Each accession folder contains one folder per transfer media named with the DMID (media id).

find-long-paths.py

Script usage: python /path/to/script /path/to/accession/directory

This script identifies and creates a CSV log of all the files in an accession with file paths that exceed the Windows maximum of 260 characters. These long file paths need to be identified and shortened prior to bagging the accession, otherwise they will raise permissions errors from bagit.py. The CSV can then be used as a change log to document the new shortened paths.

format-analysis.py

Script usage: python path/to/format-analysis.py path/to/accession_folder

Use an absolute path for the accession_folder. A relative path may prevent FITS XML from being generated.

Before running the script, download NARA's Preservation Action Plans CSV and create a file named configuration.py from the configuration_template.py in the accessioning-scripts repo.

This script extracts technical metadata from files in the accession folder, compares it to multiple risk criteria, and produces a summary report to use for appraisal and evaluating an accession's complexity.

The script is designed to run repeatedly as the archivist makes changes based on the information, such as deleting unwanted files or editing the risk assigned to each file. The script adjusts how it works based on the files in the parent directory of the accession folder, reusing information from previous script iterations where it will save time or allow manual updates.

If there are no script-generated files present, the script creates FITS XML for every file in the accession folder, and the FITS summary spreadsheet, risk spreadsheet and analysis spreadsheet.
If there is a folder of FITS XML, the script updates the FITS XML and the FITS summary spreadsheet to match the files in the accession folder, creates the risk spreadsheet (if one is not already present), and makes the analysis spreadsheet.
If there is a risk spreadsheet, the script uses it to make the analysis spreadsheet. The risk spreadsheet is not automatically updated based on changes to the FITS summary spreadsheet. The risk spreadsheet needs to be deleted if there are changes made to the files in the accession folder before running the script again so that the risk spreadsheet can be updated with the current formats.

technical-appraisal-logs.py

Script usage: python /path/to/script /path/to/accession/directory [compare]

This script requires an installation of 'pandas' in your Python environment.

This script generates a CSV manifest of all the digital files received in an accession. It also identifies file paths that may break other scripts and saves those paths to a separate log for review.

Using the "compare" argument compares the initial manifest to the files left in the accession after technical appraisal and generates an additional CSV log of any files that were deleted in the process.

accessioning-scripts's People

Contributors

Stargazers

Watchers

Forkers

bstuessi

accessioning-scripts's Issues

Keep files with duplicate names when updating FITS CSV

Location: format_analysis_functions.py > update_fits()

Description: When new FITS XML is made for files added since the last script iteration, if a new file was added that has the same name as a file in another folder (different path), it will overwrite the old FITS XML. If FITS had been run on the whole accession folder, it would have added a number (name, name-1, name-2) to differentiate between files with the same name but a different path.

Priority: This does not impact the list of format identifications, just the number of files associated with the format identification, and currently is not too likely to happen.

Add other NARA low risk to other risk category

Location: format_analysis_functions.py > match_other_risk()

Description: Other risk currently includes NARA low risk ratings with a transform recommendation. Expand this to include every recommendation except “Retain”.

Change index for subtotals

Location: format_analysis_functions.py > subtotal()

Description: Use index=False with groupby so the columns used for subtotals stay as columns. This will make it easier to do a complete test of the function output.

Impact: Also set index=False when saving the subtotals to format-analysis.xlsx at the end of format_analysis.py

Include FITS error testing location in configuration

Location: format_analysis_tests.py > test_fits_class_error()

Description: This test requires a fits.bat or fits.sh file saved to a different letter directory than the accession folder. Currently, this path must be updated within the function and if it is incorrect the test will fail. All other variables that require setup for a local machine are in configuration.py.

Impact: Also update configuration_template.py and check_configuration() in format_analysis_functions.py. It could be an optional variable since only developers run tests.

Priority: This test will not be necessary if we add a test for the letter directory to check_configuration().

Split unit test for full script

Location: test_x_full_script.py

Description: Split into three test documents, one for each iteration, and within that make specific unit tests for the status message and each of the tabs in the script output (format_analysis.xlsx spreadsheet). Right now, only test the output after round 3, so it would be hard to figure out where a problem came from. And with all the assertEqual tests inside the same unit test, if one is not equal, it will not run the rest of the tests and other errors will not be found until one is fixed. If each test document is just tests after a particular iteration, it would allow the script to set up the output once (need to figure out setup class for this), since that is slow, and then do all the testing of that output.

Error handling for gaps in data during FITS CSV creation

Location: format_analysis_functions.py, make_fits_csv()

Description: format_analysis.py script stops and throws a TypeError when it hits a NoneType object while creating the FITS CSV. This has happened on two different accessions, both of which were unusually large (100s of GB).

Example of error message:

Traceback (most recent call last):
  File "F:\scripts\accessioning-scripts-main\format_analysis.py", line 72, in <module>
    make_fits_csv(fits_output, accession_folder, collection_folder, accession_number)
  File "F:\scripts\accessioning-scripts-main\format_analysis_functions.py", line 281, in make_fits_csv
    rows_list = fits_row(f"{accession_folder}_FITS/{fits_xml}")
  File "F:\scripts\accessioning-scripts-main\format_analysis_functions.py", line 174, in fits_row
    format_data = {"name": identity.get("format"), "version": get_text(identity, "version")}
  File "F:\scripts\accessioning-scripts-main\format_analysis_functions.py", line 140, in get_text
    value += item.text
TypeError: can only concatenate str (not "NoneType") to str

Ideas: It would be helpful to be able to see that this error has occurred and identify the FITS file that is triggering the TypeError, without stopping the script entirely. Maybe error handling could print a warning to the terminal (and log the NoneType issue in the FITS CSV as NULL data, or something else?) while allowing the script to continue.

Make script mode explicit

Description: The script currently runs in a few different ways, based on what files and folders are already in the accession folder. If FITS files or the risk spreadsheet are present, it uses what is there instead of re-generating them to support an iterative workflow. Use a script argument to be explicit about how the script should run to avoid errors by incorrect tests about the state of the accession folder and to make it easier to learn how the script operates.

Impact: Throughout the script, use the mode argument to determine what to run instead of testing for if files exist.

Making strings from lists

Location: format_analysis_functions.py > fits_row()

Description: Use join to make semicolon-separated strings from lists instead of iterating on the lists. This is in two places in the function: get_text() and getting tool information.

Make format_analysis_tests.py into Python unit tests

Location: format_analysis_tests.py

Description: Use the Python unittest for the tests instead of python functions. Some of the current tests will need to be split up into smaller tests, like ones that include multiple data variations. This will be easier to maintain and make it easier to run tests in batches.

Impact: This may require splitting some functions into multiple functions for more precise testing.

Priority: It will aid in the next cycle of development (February 2023) to have unit tests ready.

Automatically update Risk CSV

Location: format_analysis.py, when read existing full_risk_data.csv into df_results

Description: When the full_risk_data.csv exists from a previous iteration of running the script, update to it to remove deleted files and add new files, including calculating their risk information, like the update_fits() function. Right now, the archivist must delete the full_risk_data.csv for the data to be updated, which means losing any edits the archivist has already made.

Make unit test for ET Parse Error

Location: test_fits_row

Description: Could not figure out how to generate the ParseError needed to make a unit test for error handling.

Priority: this is an extremely rare error

Add subtotals for technical appraisal and other risk categories

Location: format_analysis.py, when call the subtotal() function

Description: Now that technical appraisal and other risk are sorted into categories, it could be helpful to have subtotals to get a quick understanding of the kinds of risk present.

Impact: Save the new subtotals to format-analysis.xlsx at the end of format_analysis.py.

Priority: Haven’t used the workflow enough since categories were added to know if this would be helpful.

Get updated NARA Preservation Action Plans CSV automatically

Location: format_analysis.py, before reading the CSVs with data into dataframes

Description: Check if a new version of the NARA Preservation Action Plans CSV is updated in their repo and if so, download it to a consistent location. This makes sure the information is always up to date. Or if it’s possible, read the data directly from GitHub.

Impact: To remove this path from configuration.py, update configuration_template.py and don’t test for this path in the check_configuration() function.

File Format Migration Report

Description: Evaluate preservability of files by comparing the formats to existing migration capabilities and categorizing the amount of work required for preservation. For example, is the format acceptable as is, do we have an automated or manual migration pathway already established, or would it require further research and testing.

Impact: requires documenting migration pathways, and formats that do not require migration, in a machine-readable way (spreadsheet would be sufficient).

Ideas: See "Do We Really Know Our Data?" from iPres 2022 proceedings.

NARA Risk Matching: Fix FITS data

Description: Improving our format identifications could simplify risk matching. Possible methods:

If have same format with and without PUID (different tools), only keep one with PUID.
Deal with names and file extensions that are consistent mismatches.
Narrow which tools are used with which formats to eliminate ones that are always wrong.
Acronym mapping for when FITS spells out and NARA doesn't, or vice versa.
Get version numbers for FITS identifications where it is part of the name.
Split into multiple IDs if more than one version or PUID?

Replace spaces in NARA DF column names with underscores

Location: format_analysis_functions.py > csv_to_dataframe()

Description: The NARA CSV has spaces in multi-word column names, which can cause problems with pandas. It is also different than FITS column names (they use underscores), which could increase typos during development.

Priority: Everything currently works. This change is only necessary if it starts causing problems.

Use configparser for configuration file

Location: replace configuration.py with configuration.ini and use parser for check_configuration function

Description: use the python library for configuration files so it is easier to parse and validate the contents of the file. Would read the ini file and save the information to variables instead of importing as global variables in each file.

Impact: would need to add these variables as arguments to the functions.

Ideas: notes from trying this are in Digital Stewardship Teams: Programming/Accessioning Formats Script Development/configuration_experiment

Priority: The main motivation is simpler unit testing, but it works and can be tested as is.

Add temporary files to technical appraisal

Location: format_analysis_functions.py > match_technical_appraisal()

Description: Add temporary files as another category to technical appraisal. Temporary files include file names that start with either “~” or “.”, which are often but not always identified by FITS as unknown binary. The temporary files category has precedence over the format category but not the trash category.

Impact: If this is implemented, it should be removed from the technical-appraisal-logs.py script.

Regex FutureWarning

Location: get_text(), df_fits["fits_name_version"] = df_fits["fits_name_version"].str.replace("\snan$", "")

Description: PyCharm gives a FutureWarning: The default value of regex will change form True to False in a future version. Currently using Python 3.8. After upgrading Python, verify that this still works. It is removing " nan" from the end of the combined FITS format name and version in those instances where there is no version.

Combine methods of matching FITs and NARA

Location: format_analysis_functions.py > match_nara_risk()

Description: Currently, matching methods are applied in order of accuracy and no other is tried once a match is made. If this results in enough occasions where a later match method would have been more accurate, we could combine different methods to highlight potential matching errors.

Priority: This problem hasn’t been observed yet.

Improve accuracy of tests with fits csv

Description: several unittests create a dataframe using pandas with fits data, which is read from a csv in production. That can cause differences in datatypes. Instead, save fits csv files in the repo to use for testing or create the csv within the test and then delete it after.

Alternative to manually changing dates in unit test after branching

Location: test_x_full_script.py, in self.ex06, self.ex09, and self.ex11 variables in setUp

Description: Every time a new branch is made of the repo, it changes the date created of the files used for the full script unit test, which have to be manually updated in the expected results for the test. Is there a way to avoid that?

Test FITS directory letter in configuration.py

Location: format_analysis_functions.py > check_configuration()

Description: Check that the directory letter (C:, D:, etc.) of the FITS path matches the accession folder (the script argument) or it will cause an error. This error is currently caught when FITS tries to run for the first time. Catching it earlier will save some time.

Impact: Remove the error handling for “can’t find or load main class” from format_analysis.py after generating new FITS format identification information and from the update_fits() function.

Add date to full risk data csv

Description: For the Hub Monitoring scripts, we are making an updated version of the full risk data csv every few years to look for changes in risk, which include the date in the filename. Include the date when first making the full risk data csv as well, so it is clearer which file is the most recent. The new naming convention would be accession_number_full_risk_data_YYYY-MM-DD.csv.

Additional organization for data in FITS CSV

Location: format_analysis_functions.py > make_fits_csv()

Description: : The fits.csv spreadsheet is organized with one line per format identification (potentially many rows for a single file), which is required for the analysis. It may be useful to have a second version of this spreadsheet that is organized with one line per file. It might have a separate set of columns for each format identification (format_ID_1; format_ID_2, etc.), although that could lead to a lot of columns.

Priority: Need to use the workflow more to know if this would be helpful.

NARA Risk Matching: require PUID match

Location: match_nara_risk()

Description: if FITS and NARA have a PUID, do not let it match if they are not the same. This will reduce the number of false positives. Code already splits out ones with PUID for PUID matching, so could treat as 2 separate data sets during matching and combine at the end.

NARA Risk Matching: extracting version

Location: match_nara_risk()

Description: Currently, anything after the last space in NARA Format Name is used as the version. There are additional formats with versions that do not match this pattern. Two patterns that are after the last space but have additional characters to remove, which might be easy to implement, are "name (version)" and "name v.version".

Priority: waiting to see how often formats with this pattern are in our accessions.

Improved matching formats to NARA risk

Location: format_analysis_functions.py > match_nara_risk()

Description: Currently, if a PRONOM id doesn’t match, most matches are by file extension, and these are not precise because it matches each version and sometimes other formats as well. Improve matching with name and version. The challenge is FITS format identifications separate name and version, while NARA combines them in one column in different patterns.

Ideas: Extract any number from NARA format identifications to use for version matching with a name or extension. Define common mapping between FITS and NARA names. Fuzzy matching of names and versions. We did try fuzzywuzzy but couldn’t get an accurate enough match.

Priority: Cleaning up these matches is manual and time consuming to do and to document. PDF is common one with a lot of versions and they have different values.

Don't print function message in unit test

Location: test_argument, test_csv_to_dataframe

Description: Don't let the function print a message during testing. It gets mixed with the unit test output and that can be confusing.

Ideas: I tried a solution that used the io library to change sys.stdout, but it stopped working. Could try this StackOverflow solution.

Calculate paths for ITA and Risk CSVs

Location: format_analysis.py, before reading the CSVs with data into dataframes

Description: Have the script construct the paths to ITAfileformats.csv and Riskfileformats.csv, which are in the same repo as this script. The path to the script repo is in sys.argv[0].

Impact: To remove these paths from configuration.py, update configuration_template.py and don’t test for these paths in the check_configuration() function.

Make a function for subsets

Description: subsets are calculated with pandas filters in the main body of the script. Being a function would automatically keep unit tests in sync.

Ideas: might need to filter the dataframe before passing it to the function. Or could store which filters to use and/or columns to drop within the function based on a parameter that indicated the subset type.

Calculate MD5 during error tests

Location: format_analysis_tests.py, throughout

Description: Calculate the MD5 instead of using a default value for Excel and zip test files that change each time they are made.

Priority: The MD5 is copied from FITS with no transformation and there are other fields that are also copied which are tested, so this is unlikely to find something new.

Document changes to full_risk_data.csv

The archivist manually makes changes to the risk spreadsheet to clean up the automatic matches between FITS format identifications and the NARA preservation plans. Is there a way to document which matches are from the script automation and which are decisions made by the archivist?

Expand media subtotals

Location: format_analysis_functions.py > media_subtotal()

Description: The media subtotal is for the first level of folders within the accession folder, which is enough information while most accessions are on many pieces of media. When more accessions are transfers from large hard drives or the cloud, an additional level or two of folders may need to be added to this analysis.

Priority: The need for this level of information hasn’t come up yet.

Filter dataframes with multiple lines of code

Location: format_analysis_functions.py > match_technical_appraisal() and match_other_risk()

Description: There are some long lines of code that combine multiple filter conditions for a dataframe. Use the strategy of saving each filter condition to a variable and then combining those filters into a single line. This improves code understandability by using variable names to describe each filter’s purpose and making shorter lines.

NARA Risk Matching: FITS no version

Location: match_nara_risk()

Description: when matching FITS to NARA, NARA sometimes uses "unspecified version" if there is no version, while FITS has a blank for version. Having these match will reduce the number of times that multiple NARA format rows are matched to a single FITS format row if we have enough files in formats where NARA includes unspecified version.

Impact: if FITS blank versions are replaced with unspecified version, it would prevent matches for formats where there is no version.

Priority: waiting to see how often these formats are in our accessions to decide if this is worth it.

Change dictionary key

Location: format_analysis_functions.py > fits_row()

Description: The dictionary with format information has a key that combines the format name and version, which are also part of the dictionary value. We can instead use a tuple of (name, version) as the key and not duplicate the information in the value.

Sync match to NARA with Hub Monitoring script

Hub monitoring script risk_update.py is also aligning format identifications with the NARA preservation spreadsheet. Discovered that Technique 2 is not using the case insensitive columns and Technique 4 needed to remove /s and $ from the regex to be able to remove versions of NO VALUE.

Update Fits_Multiple_IDs column in Full Risk CSV during iteration

Location: format_analysis.py, when read existing full_risk_data.csv into df_results

Description: If the archivist edited full_risk_data.csv to select the single correct identification, the value of the FITS_Multiple_IDs column becomes incorrect. The script could detect if the column has TRUE but there is only one row with that file path in the spreadsheet and update the value to “FALSE - format ID selected by archivist”.

Priority: We are not relying on the value of this column for any of our analyses.

uga-libraries / accessioning-scripts Goto Github PK

accessioning-scripts's Introduction

accessioning-scripts

find-long-paths.py

format-analysis.py

technical-appraisal-logs.py

accessioning-scripts's People

Contributors

Stargazers

Watchers

Forkers

accessioning-scripts's Issues

Recommend Projects

Recommend Topics

Recommend Org