Coder Social home page Coder Social logo

nasa-pds / pds-registry-app Goto Github PK

View Code? Open in Web Editor NEW
1.0 11.0 1.0 13.58 MB

(deprecated) See https://github.com/NASA-PDS/registry for new and improved capability.

Home Page: https://nasa-pds.github.io/registry/

License: Other

Groovy 100.00%
nasa-pds elastic-search nasa pds-api pds

pds-registry-app's Introduction

pds-registry-app

DOI ๐Ÿคช Unstable integration & delivery ๐Ÿ˜Œ Stable integration & delivery

This application enables a Planetary Data System node to register all its data products for long term preservation and sharing with the rest of the PDS system.

It is composed of 2 services:

The purpose of this repository is to integrate together these services and package them conveniently for the users.

๐Ÿ‘ฅ Contributing

See pds-engineering doc produced by mvn site,

Latest stable version is published on: https://nasa-pds.github.io/pds-registry-app/install/index.html

Within the NASA Planetary Data System, we value the health of our community as much as the code. Towards that end, we ask that you read and practice what's described in these documents:

  • Our contributor's guide delineates the kinds of contributions we accept.
  • Our code of conduct outlines the standards of behavior we practice and expect by everyone who participates with our software.

Installation

See pds-engineering doc produced by mvn site, to be published.

Updating and Releasing

The PDS uses GitHub Actions and the Roundup to automate updating and releasing. The instructions that follow are kept for posterity.

Update the pom.xml with the version of the package (e.g. 1.0.0) and versions of services (registry, harvest) validated.

Create a tag and publish it on GitHub:

git tag v1.0.0
git push origin --tags

Prepare the package:

mvn compile  ## get the sub-packages
mvn linkcheck:linkcheck  ## check the link in the maven site and create a report, this requires jdk1.8 to work, later version fail.
mvn site     ## prepare the documentation
mvn package  ## generate the package
mvn github-release:github-release  ## publish the package on github  # TO BE DONE REPLACE WITH CONMAN or GITHUB ACTION

Docker

Information about containerizing the PDS Registry App with Docker follow.

Build

docker image build --build-arg version_reg_app=$(git rev-parse HEAD) \
             --file Dockerfile.local \
             --tag pds_registry_app:$(git rev-parse HEAD) \
             .

Run

Running consists of setting up Elasticsearch, loading test data, and verifying that things work.

Elastic Search

  1. prepare mkdir /tmp/es /tmp/output
    These two directories are to store the data between steps. If there is an existng elastic search database or harvest working area, then there is no need for this step.
  2. fetch docker pull elasticsearch:7.10.1
    Only need to fetch the image once.
  3. prepare a network with docker network create pds if not done previously
  4. run
    docker container run --detach \
               --env "discovery.type=single-node" \
               --name es \
               --network pds \
               --publish 9200:9200 \
               --publish 9300:9300 \
               --rm \
               --user $UID \
               --volume /tmp/es:/usr/share/elasticsearch/data \
               elasticsearch:7.10.1
    
    Start the elastic search engine and make sure that any of the data created is owned by yourself so that you can clean it up. If you are not using the /tmp/es directory created above, then make sure to change the --volume argument as needed so that elastic search has access to it. Also change the --user arguement such that elastic search has full privledges to read, write, and modify the data.

Install Test Data

This section will use the small test data set contained within the docker image. The steps used to do this can be used as a template to install your own more complicated datasets as well.

First, install the registry schema for elastic search. Things to note, the URL host name is the --name argument value used when starting elastic search.

docker container run --network pds \
           --rm \
           --user $UID \
           pds_registry_app:$(git rev-parse HEAD) \
              registry-manager create-registry -es http://es:9200 \
                                               -schema /var/local/registry/elastic/registry.json 

Second we need to harvest the test data. It is currently setup in /var/local/harvest/archive. If one wanted to use their own archive of data, they would simply have to add --volume /path/to/archive:/var/local/harvest/archive to the docker run portion of this command. Because we cannot easily run two commands in one docker run invokation, it is necessary to use an external directory for /var/local/havest/output. The ephemeral nature of a container erases what we would write into /var/local/harvest/output between the two docker run invokations.

docker container run --network pds \
           --rm \
           --user $UID \
           --volume /tmp/output:/var/local/harvest/output \
           pds_registry_app:$(git rev-parse HEAD) harvest \
              -c /var/local/harvest/conf/examples/bundles.xml \
              -o /var/local/harvest/output

Third, and last, we need to now ingest the havested data from above into elastic search. Things to note: One, the URL host name is the --name argument value used when starting elastic search. Two, the output volume must be the same mapping as in the previous step.

docker container run --network pds \
           --rm \
           --user $UID \
           --volume /tmp/output:/var/local/harvest/output \
           pds_registry_app:$(git rev-parse HEAD) \
           registry-manager load-data \
              -es http://es:9200 \
              -dir /var/local/harvest/output \
              -updateSchema n

Verification

The verification script is written in Python 3.x with requests installed as well only because that is what the person doing the work knows the best. A simple curl then lots of scrolling and eyeballing can do the same thing.

$ python3 <<EOF
import requests
response = requests.get('http://localhost:9200/registry/_search?q=*')
if response.status_code == 200:
    print ('successfully found elastic search')
    data = response.json()

    if not data['timed_out']:
        print ('successfully searched for query')

        if data['hits']['total']['value'] == 17:
            print ('successfully found all expected data')
        else: print ('FAILED to find all expected data')
    else: print ('FAILED query search')
else: print ('FAILED to find elastic search')
EOF

Output should look like this if everything is working:

successfully found elastic search
successfully searched for query
successfully found all expected data

pds-registry-app's People

Contributors

al-niessner avatar jordanpadams avatar nutjob4life avatar pdsen-ci avatar tdddblog avatar testpersonal avatar tloubrieu-jpl avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

al-niessner

pds-registry-app's Issues

The service shall accept metadata for a registered artifact in a defined format

To minimize the effort for developing custom interfaces, the format of the metadata to be submitted to the Registry should follow standards defined by the technology chosen. For instance, if Apache Solr is the chosen underlying technology, the Registry should use the Apache Solr standard ingestion data formats.

L4 Requirement:

  • ๐Ÿฆ„ L4.REG.3

registry installation's documentation may be insufficient

I'm running from https://nasa-pds-incubator.github.io/pds-app-registry/install/solr.html. I've just rebooted and want to install the "Single Node Cluster with Embedded ZooKeeper". I already had java 12 installed. I installed solr at /opt/solr. Then

% cd $SOLR_HOME/bin
% solr version
8.4.1
% java --version
java 12.0.1 2019-04-16
Java(TM) SE Runtime Environment (build 12.0.1+12)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)
% solr start -cloud
*** [WARN] *** Your open file limit is currently 256.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
*** [WARN] *** Your Max Processes Limit is currently 2837.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
Waiting up to 180 seconds to see Solr running on port 8983 []
Started Solr server on port 8983 (pid=2034). Happy searching!
%

But when I set my browser to http://localhost:8983/solr, I got 404 (attached).
Screen Shot 2020-04-04 at 23 49 45

Manage duplicate products

Is your feature request related to a problem? Please describe.
Known concern regarding registry is using Solr bulk upload overwrite any products with matching LIDVIDs. This ticket will look into uploading preventing overwrite and/or throwing WARNING when a known doc exists.

Applicable requirements
๐Ÿฆ„ TBD

The service shall require a subset of file system metadata in order to support data metrics generation

The system shall at minimum require (#52) the following file system metadata to enable the generation of system-wide metrics:

  • Base file path for product
  • File/Object Checksum (refs NASA-PDS/pds-registry-mgr-solr#60)
  • Logical identifier and version (refs NASA-PDS/pds-registry-mgr-solr#54)
  • File/Object Size
  • File/Object Timestamp (from file system) - TBD if we need this or not
  • File/Object MIME Type

Additionally, some basic PDS4-metadata should be included where applicable:

  • Label file name
  • Associate product file names (//file_name)
  • Product Class (i.e. root element of the XML)
  • Instrument Reference (LID / LIDVID)
  • Investigation Reference (LID / LIDVID)
  • Spacecraft Reference (LID / LIDVID)
  • Target Reference(s) (LID / LIDVID)

pds-app-registry doesn't build on Windows

Describe the bug
Could not build pds-app-registry on Windows

To Reproduce
Run "mvn package"
See the error
[WARNING] Cannot get the branch information from the git repository: Error while executing command. [INFO] Executing: cmd.exe /X /C "git rev-parse --verify HEAD" [INFO] Working directory: C:\ws\github\pds-app-registry [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1.177 s [INFO] Finished at: 2020-03-30T10:40:58-07:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.codehaus.mojo:buildnumber-maven-plugin:1.4:create (default) on project pds-app-registry: Cannot get the branch information from the scm repository : [ERROR] Exception while executing SCM command.: Error while executing command. Error while executing process. Cannot run program "git" (in directory "C:\ws\github\pds-app-registry"): CreateProcess error=2, The system cannot find the file specified

Expected behavior
Project should build

** Version of Software Used**
Windows 10
Master branch

Enable search for Laboratories

Ticket created by @richardchenca :

It looks like search-core does nothing with laboratory context products. If deliberate, that's fine. If overlooked, I suggest we make those searchable, but feel free to talk me out of it.

Proof: I just ingested, with other context products, into pds-gamma:
urn:nasa:pds:context:facility:laboratory.sbu_cpex
urn:nasa:pds:context:instrument:sbu_cpex.ftir
urn:nasa:pds:context:instrument:sbu_cpex.raman
then ran search-core. I can find the two instruments, but I can't find the laboratory. Also, the data collections that lid_reference those three return links to the instrument but not the laboratory, e.g.
https://pds-gamma.jpl.nasa.gov/ds-view/pds/viewCollection.jsp?identifier=urn%3Anasa%3Apds%3Alab_shocked_feldspars%3Adocument&version=1.0

This may require an update to both Harvest and Registry. Not sure where the issue lies.

Update Schema Generator for handling special cases where ancestor classes are needed

pds:Internal_Reference/pds:lid_reference and pds:Internal_Reference/pds:reference_type are completely useless without the context of the ancestor class. We need to build in a mechanism to enable these "special cases" where we need additional ancestor classes to be included.

Observing System component is another special cases that is pretty useless without some massaging of the data.

Acceptance criteria:
When a pds4 label is loaded in the registry
Then the different references are stored in the document in attributes like "ref_lid_", for example ref_lid_intruments, ref_lib_investigation...

Manage PDS4 product relationships

Problem

Currently, the Registry does not "smartly" handle relationships according to the PDS4 standard. It does what it was originally intended to do, which is be a general knowledge base of products in the system, with no real understanding of how all those products relate to one another. This Epic will add in some PDS4 domain knowledge to better enable search/browsing based on those relationships.

  1. explicit relationships between products - described in the XML label

    1. bundles referencing collections
    2. data products referencing other products
    3. context product referencing other context products (e.g. investigation -> instrument_host)
  2. implicit relationships - described in the product files themselves, or through some understanding of the domain

    1. collections referencing products - we need to pull open the .tab and create these relationships
    2. enable context product reference consistency and hierarchy - e.g. investigation -> instrument_host -> instrument -> target means investigation -> target
    3. product reference hierarchy - bundle -> collection -> product means that bundle -> product

Design Decisions

  • Should this be 1 or more new Solr collections?
  • Should this be something that runs as a cron as part of the registry / harvest installation?
  • Is this something we run as a cron and update on other DNs registries?
  • Are these collections something we just maintain at EN (e.g. no one else cares about context products, and most nodes don't care about collections, they just care about products)

Applicable Requirements:

  • TBD new requirement

Refactor Registry Manager and Harvest to use ElasticSearch tech stack

Peform trade study and port Registry from Solr to ES.

Registry Manager commands:

  • create-registry,
  • delete-registry,
  • create-schema,
  • update-schema,
  • load-data,
  • delete-data,
  • export-data,
  • set-archive-status,

Harvest:

  • Use JSON as default output format
  • when generating JSON, replace '.' with '/' in field names

Design and Implement capability to handle archive_status

This may belong as part of Tracking Service, but definitely need to figure out this component for managing lifecycle of products ingested.

Supported archive status values:
ARCHIVED - Passed peer review with all liens resolved. Available through the PDS4 search and retrieval services.
ARCHIVED_ACCUMULATING - Some parts of the product are ARCHIVED, but other parts are in earlier stages of the archiving process and/or have not yet been delivered to PDS; use with caution.
IN_LIEN_RESOLUTION - Peer review completed; liens are in the process of being resolved; use with caution.
IN_LIEN_RESOLUTION_ACCUMULATING - Some parts of the product are IN LIEN RESOLUTION, but other parts are in earlier stages of the archiving process and/or have not yet been delivered to PDS; use with caution.
IN_PEER_REVIEW - Under peer review at the curation node but evaluation is not complete; use with caution.
IN_PEER_REVIEW_ACCUMULATING - Some parts of the product are IN PEER REVIEW, but other parts are in earlier stages of the archiving process and/or have not yet been delivered to PDS; use with caution.
IN_QUEUE - Received at the curation node but no action has been taken by the curation node; use with caution.
IN_QUEUE_ACCUMULATING - Some parts of the product are IN QUEUE, but other parts have not yet been delivered to PDS; use with caution.
LOCALLY_ARCHIVED - Passed peer review with all liens resolved; considered archived by the curation node but awaiting completion of the standard archiving process; possible TBD items include the arrival of the archive volume at NSSDC and ingestion of context information.
LOCALLY_ARCHIVED_ACCUMULATING - Some parts of the product are LOCALLY ARCHIVED, but other parts are in earlier stages of the archiving process and/or have not yet been delivered to PDS; use with caution.
PRE_PEER_REVIEW - Being prepared for peer review under the direction of the curation node; use with caution.
PRE_PEER_REVIEW_ACCUMULATING - Some parts of the product are in PRE PEER REVIEW, but other parts are IN QUEUE and/or have not yet been delivered to PDS; use with caution.
SAFED - The product has been received by the PDS with no evaluation; product will not be formally archived.
SUPERSEDED - This product has been replaced by a newer version, implying that the product is not to be used unless the requester has specific reasons; when a product has been superseded the Engineering Node, will notify NSSDC that their databases need to be updated to advise users of the new status and the location of the replacement product.

Unable to build application due to file size limit

Describe the bug
Get this error when I try to build the software with maven

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:3.1.1:single 
(bin-release) on project pds-app-registry: Execution bin-release of goal
org.apache.maven.plugins:maven-assembly-plugin:3.1.1:single failed: group id '703763885' 
is too big ( > 2097151 ). Use STAR or POSIX extensions to overcome this limit -> [Help 1]

Develop Registry reporting / visualization component

Provide something like Sol Banana or some other visualization component that can sit on top of the Registry and Search Services to enable visualization / access to metadata out of the box.

This can be very simplistic, but just something that comes built-in so users can easily see things like:

  • archive metrics based on registry data
  • set of attributes that exist in the indices
  • attribute value sets

Actors: PDS Node Personnel, PDS EN Personnel (not especially tech saavy)
Use Cases:

  • As a PDS Node, I want to be able to show a pie chart of all the different types of data products ingested into the PDS (Product Collection vs Bundle vs Products
  • As a PDS Node, I want to see some simple visualization of the raw vs derived vs partially processed.
  • As a PDS Node, I need to be able to easily answer questions like:
    • How many MER products are in the archive? How much total data size?
    • How many MER products by instrument are in the archive? How much total data size?

Develop ability to deploy Test vs Operations Registry environments

Operational deployment documentation should identify both test and operations environments.

Some stories to consider:

  • As a DN User, I want to be able to use a test environment to play around with registry data
  • As a DN User, I want to be able to swap all test data into operations.
  • As a DN User, I want to be able to swap some subset of my test data into operations.
  • As a DN User, I want to be able to rollback changes made in operations to previous version.

Update POM to use PDSEN Parent POM

See validate or other component POM. Once complete, a lot of these details can be removed.

  <parent>
    <groupId>gov.nasa</groupId>
    <artifactId>pds</artifactId>
    <version>1.3.0</version>
  </parent>

Harvest does not work with registry manager

Describe the bug
When i follow the test procedure for the registry app, the record produced by harvest cannot be saved in registry with the registry manager command

To Reproduce
Steps to reproduce the behavior:

  1. clone pds-app-registry
  2. generate the site with mvn site
  3. follow instructions in install/test
  4. See error:
    $ registry-manager load-data -filePath ~/tmp/harvest_result00/
    Loading file: /Users/loubrieu/tmp/harvest_result00/solr-docs.xml
    ERROR: Error from server at http://localhost:8983/solr: ERROR: [doc=::] Error adding field 'vid'='' msg=empty String

Expected behavior
No error, the record should be inserted in the registry.

** Version of Software Used**
harvest 3.2.0
registry 3.2.0
pds-app-registry 0.1.0
jdk 11

Desktop (please complete the following information):

  • OS: macos

Beta test operational deployment

Some possible discussion points / documentation updates:

  • Per operate/reg-custom.html, When you load data, unknown (undefined) fields are ignored. - does Solr still throw some sort of warning message so the user knows there was an issue with the metadata.
  • Use of /tmp - Most PDS node users will literally use exactly the path we say. /tmp is dangerous because it can get wiped out and often does not have enough disk space to handle large data ingestions. we should maybe add something from the outset that says "put all your data in some data directory you know is stable and has X storage capabilities", create that directory (do we want to create an environment variable or just let the nodes figure it out?), and then use it throughout the docs.
  • Generating Solr fields from PDS4 data dictionaries and Updating Solr schema from PDS4 data dictionaries these look excellent. I think we may need to clarify the docs elsewhere where if you try to ingest data, and it fails because fields do not exist, it may mean you are trying to ingest some data where the registry schema has not been updated with the data dictionaries.
  • From prod/zookeeper-vm.html for Install Java, let's just link to the Java Installation page we already have.
  • Production Deployment - the details here are fantastic, and I think totally necessary. but we should think about how we can simplify this in some way. could we bootstrap this? or because of the differing OSes, maybe we just have 2 top-level procedures, 1 for single node deployment, another for multi-node, with a procedure for setting each of these up or just link to the docs you have that pretty much call out these examples? it would just be great if we could direct users to one place to do some baseline deployment, and then we can link off to all the config details.

Improve CI/CD to deploy Github Pages and enable conman to work with the software deployment

When I try to run conman it complains about file size:

[INFO] Building tar: /private/tmp/pds-app-registry/target/pds-app-registry-0.3.0-bin.tar.gz
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:12 min
[INFO] Finished at: 2020-04-11T11:22:25-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:3.1.1:single (bin-release) on project pds-app-registry: Execution bin-release of goal org.apache.maven.plugins:maven-assembly-plugin:3.1.1:single failed: user id '408967207' is too big ( > 2097151 ). -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
ERROR: Maven build failed. Check output
ERROR: /Users/jpadams/Documents/proj/pds/pdsen/workspace/conman/scripts/build-java.sh did not complete as expected. check output.

This is probably fine because the only thing it really needs to do from build-java.sh is the POM versioning.

In addition to that issue, we need to update our CI/CD to include github pages.

Develop PDS4 Attribute Search capability

  • EN-specific Search Service capability
    • Dynamic schema / field generation from Dictionaries
      • Tool can something EN runs to consistently out schemas
      • But also needs to be something a node can run dynamically
      • Solr Schema
        • Parser out Xpath:value as JSON
          • /path/to/namespace:Class/namespace:attribute
        • Convert to dot notation based upon dictionary JSON output
          • /path/to/Class/attribute => namespace.Class.namespace.attribute
        • dynamic cart.Bounding_Coordinates.cart.west_bounding_coordinate
      • Class.attribute
      • Filter.filter_name

Design and document Registry cluster expansion approach

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Applicable requirements
๐Ÿฆ„ #TBD

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.