Coder Social home page Coder Social logo

uga-libraries / general-aip Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 0.0 960 KB

This is the general workflow to make archival information packages (AIPs) that are ready for ingest into the UGA Libraries' digital preservation system (ARCHive). The workflow organizes files, extracts and formats metadata, and packages the files. It may be used for any combination of file formats.

License: Creative Commons Attribution Share Alike 4.0 International

XSLT 19.13% Python 70.19% HTML 10.68%

general-aip's People

Contributors

amhanson9 avatar emkaser avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

general-aip's Issues

md5deep error on 64-bit machine

Get an error from md5deep if use it instead of md5deep64 on a 64-bit machine. Update the test for configuration.py validation to include if it needs to be md5deep64? For now, just updating the template to indicate it needs to be md5deep64.

Check WARC fixity for web AIPs?

WARC are downloaded and unzipped from Archive-It prior to running this script to create AIPs. The zipped warc fixity is verified after download. The unzipped WARC fixity is not calculated until this script makes the bag. I think it happens fast enough that there isn't a need to calculate fixity at the time of unzip and verify it against the bag here but do give it some more thought.

Add to test_validate_bag

Code will write the entire message to the log if there is no information in error.details, but don't know how to force there to be no information in error.details to test this. Not sure this could ever happen.

Log Review

I've split some steps into multiple functions since designing the log. Make sure all steps are being logged properly.

  • Making bag and not just bag validation
  • Structure directory: log objects folder made correctly even if metadata isn't
  • May be others

Package function: tar and zip with Python

Currently using 7zip with Windows and command line in Mac. To be operating system agnostic and to speed up in Windows, switch to using a Python library to tar and zip.

Permissions error when moving AIP to error folder

Location: aip_functions.py, line 31, os.replace() in move_error() function

Description: Attempting to move an AIP with an error to the corresponding error folder can raise the following kind of permissions error: PermissionError: [WinError 5] Access is denied: [aip-dir] -> [error-dir] This crashes the script and prevents any other AIPs from being created.

Impact: The move_error() function is referenced several times in aip_functions.py: lines 100, 106, 161, 206, 212, 242. It's also used in general_aip.py in line 106.

Ideas: The script should log this permission error and restart the AIP creation process with the next AIP folder in the directory. It may be helpful to print a message to the terminal and somehow label the AIP as having an error since it will have failed to move to the correct error folder.

Make delete temp log as files are found?

Location: delete_temp()

Description: Currently, deleted files aren't saved to a log until all files have been deleted. The information is stored in a list until it can be saved. This means if the script is interrupted, there is no record of what was deleted. Could have it save one row at a time but would need a way to add the header row the first time something was deleted.

Improve XML Comparisons

Tests: test_combine_metadata, test_make_cleaned_fits_xml

Description: The FITS XML does not have a consistent order for the attributes within an element. The current method of comparison reads the document as one string, which requires them to be in the same order for the test to pass. Is there a way to read and compare the documents which is aware of the XML structure and does not see an error if the attributes are present and just in a different order?

Print errors from within validation functions

To have less code in the main body, print errors within check_arguments(), check_configuration(), and check_metadata_csv() before raising the error that quits the script. Still quit from the main body so that it is clear at a glance what would cause the script to exit early.

Work with Archive-It content

Currently, using some of the functions from this script in the web-aip script, but that causes syncing issues. Instead, rework this script to take input from web-aip. Main changes are to sort the metadata files into the metadata folder and update the stylesheets, if needed.

Simplify date variation in preservation.xml

May end up with repeated dates of creation because of different tools have different time stamps but the day (the only part included) is the same. Detect and remove these near duplicates.

Do we need FITS XML in the AIP?

This is a workflow issue first, but do we actually need and use all the FITS XML being saved in the AIP? Or is the information in preservation.xml enough? Scaling issue for ARCHive application.

Add error handling if tar is not made

package()

Have had cases where the tar is not made. Capture results from subprocess and test for errors. Could not use move_error due to Permissions problem. I think 7zip must not let go of the file but there doesn't seem to be a way to close it (should be automatic). I tried absolute and relative paths and making sure the paths exist, but no luck. Instead, I'm having manifest() check if an AIP exists and if not it stops. If so, it finishes the log as complete. Don't need to do this for zip because if the tar can run, it should be able to zip.

For testing, can reproduce the error by changing the source path in the tar command.

Format identification if PUID

When there is a format identification with a PUID, do not include another format identification that is the same name and version without the PUID. I gave it a try August 2022 but wasn’t successful. I could figure out xpaths (below) but not how to combine into choose/when logic. Might need to nest where first split has PUID and doesn’t, and then do the comparison with version.

Find elements without a PUID:
not(externalIdentifier[@type=’puid’])

Find elements with a PUID:
/externalIdentifier/@type='puid'

Find elements before or after the element (directly or with something between) with the same format.
@Format = following-sibling::identity/@Format or @Format = preceding-sibling::identity/@Format

Find elements before or after the element (directly or with something between) with the same format and version. Error: this will incorrectly match if one element has the same name and a different element has the same version. This could easily happen since some version numbers are very common.
@Format = preceding-sibling::identity/@Format and version = preceding-sibling::identity[@Format = @Format]/version

Make test for AIP class instance

There is no test for creating an instance of an AIP class, but it is also very simple (assigning parameters to variables and default values for variables).

Maintain original folder names

Currently, the top-level folder of each AIP is renamed to the AIP ID, but sometimes that is a meaningful title supplied by the creator. The digital archivist is moving that top-level folder into another folder which can be named with the AIP ID. Have the script do this automatically instead? Or do we need flexibility to not always retain the name of the top folder?

Improve unit tests with XML comparisons

Current method, using open() to read the file and compare to another file in the repo with the expected values, can cause errors due to differences in the order of element attributes. Is there a way to check the contents of the XML are the expected values that would be aware of the different components and not care about the order?

Simplify preservation.xml creation

Is it possible to use Python instead of XSLT for making and validating the preservation.xml? Python is much more familiar to current developers than XSLT, and it would eliminate two dependencies (saxon and xmllint, which is the only reason we still install Strawberry Perl). It could also cut out intermediate steps for making combined and cleaned FITS XML before getting to preservation.xml.

Would it be possible to use json instead of XML?

Make test for check_configuration function

I haven't figured out how to test this yet. The configuration file is read when the script runs, so I can't just pass incorrect values to the function.

The code is simple (test if variable exists and test if path exists) and I have tested it manually with configuration files that have errors, so it is a lower priority to figure this out.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.