Coder Social home page Coder Social logo

nasa-pds / deep-archive Goto Github PK

View Code? Open in Web Editor NEW
7.0 12.0 4.0 27.21 MB

PDS Open Archival Information System (OAIS) utilities, including Submission Information Package (SIP) and Archive Information Package (AIP) generators

Home Page: https://nasa-pds.github.io/deep-archive/

License: Other

Python 24.20% HTML 75.80%
nssdca nasa oais pds4 nasa-pds pds

deep-archive's People

Contributors

jimmie avatar jordanpadams avatar nutjob4life avatar pdsen-ci avatar tloubrieu-jpl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-archive's Issues

Develop initial AIP generation component

L5.NSS.2 โ€“ The tool shall be capable of generating a valid Archive Information Package transfer manifest (Product_AIP) and PDS4 XML label in accordance with the PDS4 Information Model.

This task includes:

  • creation of the new AIP product
  • update the SIP to include the checksum of the newly created AIP (this value is currently all zeros)

Overview

Here is a template AIP to work from: template_Product_AIP_v1D00.xml.txt

You will also need to create 2 other files (checksum manifest could be an input file, which we have TBD PR out there to update):

File Naming

Similar to the SIP file naming, let's use the bundle name with some add-ons (correct me if this is not what we did for SIP):

  • <bundle_lid>_aip_v1.0.xml
  • <bundle_lid>_checksum_manifest_v1.0.tab
  • <bundle_lid>_transfer_manifest_v1.0.tab

Things to be aware of

  • From the looks of it, we probably should have just created the AIP first and then tackled the SIP.
  • It appears we basically just need to output some other intermediate files that are just different permutations of the information needed for the SIP manifest.
  • Per note above, the checksum manifest will eventually be an optional input to the software, so keep that in mind during implementation.

Applicable requirements
Primary - ๐Ÿฆ„ #45
Secondary - ๐Ÿฆ„ #48 ๐Ÿฆ„ 50 ๐Ÿฆ„ 52 ๐Ÿฆ„

Add READMEs to AIPs and SIPs

In initial design and implementation of AIPs and SIPs, READMEs were not taken into account. See requirement #50.1.1 for more details.

Applicable requirements
๐Ÿฆ„ #50

Add timestamp to SIP / AIP product LIDs and filenames

Is your feature request related to a problem? Please describe.
Since people may be generating these products on bundles referencing collections by LID only, we need to add some other information to the product LIDs in order to ensure uniqueness. We could version the SIPs, but I think the best option is to add a timestamp (YYYYMMDD) to the logical_identifier and filename for the SIP / AIP products so that they are unique. Otherwise, the software will produce the same files when run for each data release.

e.g. :

insight_cameras_v1.0_20200601_aip_v1.0.xml

with a LID of:

urn:nasa:pds:system_bundle:product_aip:insight_cameras_v1.0_20200601

Open to other ideas?

Refs #24

AIP Generator not outputting correctly for spice example bundle

Describe the bug

Run pds-deep-archive or aipgen (either command) as follows:

$ bin/pds-deep-archive --site PDS_NAI --offline --bundle-base-url https://naif.jpl.nasa.gov/pub/naif/pds/pds4/insight/ insight_spice/bundle_insight_spice_v001.xml 
$ bin/aipgen insight_spice/bundle_insight_spice_v001.xml

This results in the checksum manifest and transfer manifest containing all versions of the collection, but should only contain those referenced from bundle_insight_spice_v001.xml

Test Data / Additional context

See #39 for related issues with the spice example bundle.

SIP and Transfer Manifest duplicate files and LIDVIDs

Describe the bug
Running against the LADEE test bundle it produces duplicates of the applicable files in the SIP and transfer manifest.

Other thoughts
We should include in this bug fix an regression test that can be part of the CI/CD that tests running pds-deep-archive against the ladee test bundle and verifies what is output from the software produces what we have in test/data/ladee_test/valid/.

Implement SIP generation offline capabilities

Formal requirements to come (#3), but for now, we need to get something together to accomplish this.

Informal requirements:

  1. Software shall produce SIP Manifest table (and eventually AIP manifest (#2)) with accompanying label
  2. The SIP Manifest should be a tab-separated 4 column manifest table with columns, checksum, checksum_type, file_specification_url, LIDVID (see example ladee_mission_bundle_SIP_4col_manifest_v1.0.xml and ladee_mission_bundle_SIP_4col_manifest_v1.0.tab in attached .zip)
  3. A user will input a bundle directory, and the software will read the bundle.xml at the top level of the bundle. The software then proceed to query the registry for all primary products associated with that bundle (using the bundle lidvid) (not sure if this is possible, if not. we may need to parse all the collection lidvids/lids from the bundle.xml and use those to query the registry).
  4. From the registry output, we should parse the necessary fields to populate the SIP Manifest.
  5. More to come...

Use https://pds-dev-el7.jpl.nasa.gov/services/registry/pds/select?q=*%3A* for testing.

Example Python implementation, SIP/AIP products, templates, etc: aaaZip_NSSDC_20190820.zip
Test bundle: https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/mission_bundle/ (fixed URL)

Applicable requirements
Primary - ๐Ÿฆ„ #48
Secondary - ๐Ÿฆ„ #49 ๐Ÿฆ„ #45 ๐Ÿฆ„ #46

Improve CI/CD to only build off master and PRs and use datetime for PyPi deployment

Requested update
This update includes two proposed mods:

  • Only build and deploy to PyPi on PRs and off merges to master (this may already be the case, but want to make sure we don't discourage people from pushing often to branches)
  • Per #26 (comment), it would be great to update the Test PyPi push to use datetime to avoid these erroneous errors

Rationale
Eventually we plan to use the CI/CD from this repo as a template for other repos (and add to https://github.com/nasa-pds/pds-template-repo-python), so when we start to have repos with multiple people creating PR and branches, it would be great to avoid all these false failures to avoid confusion. Otherwise we will get numb to failures.

Update software to handle Document_File.directory_path_name

Is your feature request related to a problem? Please describe.
Another curveball example of how to specify file paths for documents.

            <Document_File>
                <file_name>image007.png</file_name>
                <directory_path_name>InSight_EastWestMGA_files</directory_path_name>
                <document_standard_id>PNG</document_standard_id>
            </Document_File>

The files can then be within that subdirectory to the current XML file location. For instance, this file is at:

urn-nasa-pds-insight_documents/document_rise/InSight_EastWestMGA.xml

so the PNG above would be at:

urn-nasa-pds-insight_documents/document_rise/InSight_EastWestMGA_files/image007.png

Example bundle at https://pds-geosciences.wustl.edu/insight/urn-nasa-pds-insight_documents/

Applicable Requirements:
๐Ÿฆ„ #50

Software does not exit after running against large data set

Describe the bug
When running against the large insight bundle on pds-dev-el7, the software just hangs after it completes the document generation:

[jpadams@pds-dev-el7 ~]$ pds-deep-archive -s PDS_IMG -b https://pds-imaging.jpl.nasa.gov/data/nsyt/insight_cameras/ /data/home/pds4/insight_cameras/bundle.xml
INFO ๐Ÿ‘Ÿ PDS Deep Archive, version 0.1.0
INFO ๐Ÿƒโ€โ™€๏ธ Starting AIP generation for /data/home/pds4/insight_cameras/bundle.xml
WARNING ๐Ÿ“ฏ XML file /data/home/pds4/insight_cameras/catalog/catalog.xml lacks a logical_identifier; skipping it
INFO ๐ŸŽ‰  Success! AIP done, files generated:
INFO โ€ข Checksum manifest: insight_cameras_v1.0_checksum_manifest_v1.0.tab
INFO โ€ข Transfer manifest: insight_cameras_v1.0_transfer_manifest_v1.0.tab
INFO โ€ข XML label for them both: insight_cameras_v1.0_aip_v1.0.xml
WARNING ๐Ÿคš Notice! --include-all-collections isn't yet supported; will assume it's enabled for now
INFO ๐Ÿƒโ€โ™€๏ธ Starting SIP generation for /data/home/pds4/insight_cameras/bundle.xml

(hangs here)

I am guessing the database or something else is being left open, which is causing this bug?

Test Case 1: Bundle that references non-accumulating collections by LIDVID

Implement initial capability to accept pre-generated checksum manifest file

  • Create flag for accepting 1 or more input checksum manifest files - these files can be used in lieu of generating a checksum

  • Create flag for manifest base directory (base_dir) - directory where paths for checksum manifest begin, e.g. from this manifest they would need to specify some path so we would be able to track down:
    $base_dir/./data_hkl1/orbit_c/20190805T133001S610_ocm_hkL1.csv

  • Create flag for specifying manifest checksum type - MD5 (default), SHA-1, SHA-256

  • Read manifest file into memory (should we put a cap on file size so we make sure we don't crash the system?)

  • Perform exact same functionality as "offline" mode, except when a file matches that path identified in the checksum manifest, do not generate a checksum and use the value identified in the checksum manifest

  • Manifest(s) will look something like this this or this

  • Manifest format: assume the format is MD5Deep 4.n (https://pds.nasa.gov/datastandards/documents/im/current/index_1D00.html#value_pds_checksum_manifest_pds_parsing_standard_id_md5deep%204.n) which basically points to http://md5deep.sourceforge.net/

See internal Github issues for additional comments.

Update docs to include release instructions

Not sure how to tag / build the software beyond its current test deployment.

Additionally, want to make sure all the CLI tool versions sync up with that overall package versioning (looks like they are hard-coded right now?).

Further AIP and SIP generation speed improvements

The Issue

Issue #13 demonstrated how certain data (like the 1.3TiB insight_cameras) basically caused sipgen to not terminate. We've addressed that by using better algorithms and adding caching, but we can go steps further.

For example, sipgen still does some redundant XML parsing and aipgen does some single-threaded hash generation that hits the Python GIL. In issue #13 we architected things to include a temporary sqlite3 database that could be shared by numerous processing (using the multiprocessing module, for example) that's ripe for further optimizations.

Some Ideas

  • Additional use of sqlite3 in sipgen: process XML files just once and store the useful information in multiple tables
  • Multiprocessing: in sipgen use parallel processes and the sqlite3 database to accelerate
  • Producer-consumer: make multiprocessing workers consume XML and hash computations as they are done; in aipgen, for example, make one worker walk the directory tree for files to pass into a queue while multiple other workers snag files for MD5 digests.
  • Streaming: provide information as it becomes available so users have feedback that things are getting done instead of wondering what if things are just hanging

Context

See issue #13 and the commits made against it.

Improved user operation and installation guides with github pages

This task is to generate github pages for the user guide using favorite Python docs generator.

  • Deploy pds-deep-archive to pypi for easy install
  • Dependencies/Assumptions - e.g. python3 must be installed
  • Installation Guide - let's start with buildout instructions then have alternative instructions using virtualenv since most of our users will not be able to do that
  • Operation Guide - very basic "this is how you run it" with an output of -h

SIP Generator not outputting correctly for spice example bundle

Describe the bug
When I run the latest pds-deep-archive software, there are a few issues with the output:

bin/pds-deep-archive -b https://naif.jpl.nasa.gov/pub/naif/pds/pds4/insight/ -s PDS_NAI insight_spice/bundle_insight_spice_v001.xml -d

Note: I pointed the software at v1 of the bundle (bundle_insight_spice_v001.xml ) so it should take that version and the versions it is referencing into account.

  1. The SIP only contains 1 product (should contain many more)
  2. The checksum manifest and transfer manifest both include all versions of the collections, but these should only contain the versions referenced from bundle_insight_spice_v001.xml. See #41

Implement PDS Deep Archive as-a-service capability

Develop SIP Gen functionality to be stood up as a set of micro-services for generating SIPs alongside the data.

Each node has SIP Gen service running, and we can ping the service to generate a SIP/AIP and it will return the necessary files.

SIP label contains invalid manifest URL using file path

Describe the bug

<manifest_checksum>ee7049f6cc8688ac75e93527fe293541</manifest_checksum>
<checksum_type>MD5</checksum_type>
<manifest_url>file:/home/jpadams/insight.spice_v4.0_sip_v1.0.tab</manifest_url> << ERROR <<
<aip_lidvid>urn:nasa:pds:system_bundle:product_aip:insight.spice_v4.0::1.0</aip_lidvid>
<aip_label_checksum>bb5f119d287d01823a3d84f72608c23d</aip_label_checksum>

ERROR above needs to be a URL and should never be a file path

NOTE: This should really be validated in the Schematron / Schema formation rules, but it is indicated as a String field

Improve SIP Gen performance for very large data sets

Is your feature request related to a problem? Please describe.
When we get a bundle with 100ks of products, we need the ability to potentially multi-thread this processing. I could see us needing to do this both locally on a server within python or using gnuparallels, but also in the future on something like lambda.

Not sure where the components should be broken out exactly, but one possibility would be to break the crawling from the processing of each file to check for lids / generate checksum, etc.

NOTE: This may not be a problem once we get registry-based generation up and running, we just may need to update registry instead to handle queries based on a manifest that would then return a checksum manifest from the registry.

Update software to only include latest collection in when bundle references LIDs

Is your feature request related to a problem? Please describe.
Currently, when a bundle only references LIDs, the software looks for all matches for a LID in collection products. We should only grab the latest version.

NOTE ๐Ÿ’ฅ : There should be a flag to ignore this so we can use this software on previous releases of PDS4 data. Something like:

--include-all-collections     For bundles that reference collections by LID, this flag 
                      will include ALL versions of collections in the bundle. By default, 
                      the software only includes the latest version of the collection

Applicable requirements
Primary - ๐Ÿฆ„ #50 (see Assumption 3)

The tool shall include products in the manifests based upon the relationships described in the PDS4 bundle and collection metadata

Requirement

The tool shall include products in the manifests based upon the following criteria (see Details below for more detailed information):

  1. Bundle (B) specified as input to the tool and any associated readme (R)
  2. Primary Collections (C1, C2, C3) associated with that Bundle (B)
  3. Primary products associated each of those Collections (C1, C2, C3)

Assumptions

  1. Products include both the label (.xml) and the files referenced from it.
  2. Any file_name referenced in a label can be assumed it is in the same directory as the parent product.
  3. Any collections referenced by LID will only include latest version of associated collections in SIP and AIPs (flag to disable this and include all collections for backwards compatibility).

Details

1. Bundle (B) specified as input to the tool and any associated readme (R)

  1. Bundle XML (included as input to the tool)
  2. Readme's referenced by the bundle (//File_Area_Text/File/file_name) (See Assumption 2 above for where to look)

2. Primary Collections (C1, C2, C3) associated with that Bundle (B)

To identify the primary collections of a bundle get the LIDs/LIDVIDs per:

  1. All//Bundle_Member_Entry/lidvid_reference/ + //Bundle_Member_Entry/member_status/value() == Primary
  2. All //Bundle_Member_Entry/lid_reference/ + //Bundle_Member_Entry/member_status/value() == Primary (see Assumption 3 above)

To find those products, assume any collections referenced by the bundle will be in the same directory or in any sub-directory of the input bundle.

3. Primary products associated with each of those Collections (C1, C2, C3)

From collection labels C1, C2, C3, here is how we can get the product LID/LIDVIDs:

  1. Parse out //File_Area_Inventory/File/file_name/ and //Inventory/field_delimiter/ to prepare to parse the inventory.
  2. Based upon Assumption 2 and the information from step 1, find the file and parse all primary products (line start with a P, not an S (secondary))
  3. Within the collection directory or all sub-directories, include all products based upon the LIDVIDs specified in step 2.

Update SIP LID to include bundle version id

Currently SIP LID uses the following naming scheme:

urn:nasa:pds:system_bundle:product_sip_deep_archive:${BUNDLE_LID}

However, this will not work for accumulating bundles, so we need to include the bundle version ID as well:

urn:nasa:pds:system_bundle:product_sip_deep_archive:${BUNDLE_LID}.${BUNDLE_VID}

Update CI/CD to fix PyPi deployment failure and improve documentation deployment

Tasks

Update Usage instructions to more clearly define command-line arguments

Below, I replaced the prompt "[pds-deep-archive] /Users/rchen>" with "%" for clarity. I'm running tcsh

/Users/rchen> source $HOME/.virtualenvs/pds-deep-archive/bin/activate.csh
% pds-deep-archive --version
pds-deep-archive 0.1.0
% pds-deep-archive -s PDS_ATM -b https://atmos.nmsu.edu/PDS/data/PDS4/LADEE test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml
usage: pds-deep-archive [-h] [--version] -s
{PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
[-n] -b BUNDLE_BASE_URL [-i] [-d | -q]
IN-BUNDLE.XML
pds-deep-archive: error: argument IN-BUNDLE.XML: can't open 'test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml': [Errno 2] No such file or directory: 'test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml'

[ATMOS seemingly reorganized their directory, but what I think is correct doesn't work either]

% curl https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/mission_bundle/LADEE_Bundle_1101.xml
[I'm manually replacing lessthans with the ampersand equivalent]
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://pds.nasa.gov/pds4/schema/released/pds/v1/PDS4_PDS_1101.sch" ?>
[snip...]
% pds-deep-archive -s PDS_ATM -b https://atmos.nmsu.edu/PDS/data/PDS4/LADEE mission_bundle/LADEE_Bundle_1101.xml
usage: pds-deep-archive [-h] [--version] -s
{PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
[-n] -b BUNDLE_BASE_URL [-i] [-d | -q]
IN-BUNDLE.XML
pds-deep-archive: error: argument IN-BUNDLE.XML: can't open 'mission_bundle/LADEE_Bundle_1101.xml': [Errno 2] No such file or directory: 'mission_bundle/LADEE_Bundle_1101.xml'

Throw fatal error upon encountering an invalid logical identifier

L5.NSS.11 - The tool shall require a Submission Information Package manifest only contain products with valid logical identifiers according to the PDS4 Standards Reference.

According to the SR 6D.2, logical identifiers must follow the following formation rule:

urn:nasa:pds:<bundle_id>:<collection_id>:<product_id>

where <bundle_id> and <collection_id> are parsed from their identifiers.

If the SIP Gen software encounters these invalid LIDs in bundle references or collection inventories, display critical errors for each failed LID and then exit (stop before going to product level and generating manifest)

At the end, display some notice like the following:

SIP Generation Failed: NSSDCA Submission Information Package cannot be generated for bundles that contain malformed logical identifiers.

Please contact [email protected] for a workaround method for submission.

Applicable requirements
Primary - ๐Ÿฆ„ #53

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.