Coder Social home page Coder Social logo

ble-lter / metaegress Goto Github PK

View Code? Open in Web Editor NEW
5.0 6.0 3.0 658 KB

R package to create Ecological Metadata Language documents from an instance of LTER-core-metabase database schema

Home Page: https://BLE-LTER.github.io/MetaEgress/

R 53.33% PLpgSQL 46.62% Rez 0.05%
r metadata lter postgresql r-package xml eml-metadata eml eml-files

metaegress's People

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

metaegress's Issues

make taxonomy processing more robust, set when to rely on EML and when not to

The taxonomic capabilities of metabase and MetaEgress have been untested so far AFAIK. They can be reviewed and revised for robustness in anticipation of BLE LTER needing them. In addition, EML moved away from using taxize for their set_taxonomicCoverage function, instead using taxadb which supports fewer taxonomic providers and notably doesn't support WORRMS.

MetaEgress needs to decide:

  • how to support providers not supported by taxadb
  • how to support multiple providers per set of supplied taxonomies

Ideas:

  • use EDI's taxonomyCleanr
  • develop different handling for different scenarios

fix userid

to be even with core-mb":

  • remove concatenation of "orcid.org" to user IDs. paste in the whole thing
  • use userid_directory in view

replace use of subset() throughout package with standard subsetting functions

According to subset() documentation and general R programming resources, subset() is better used interactively than in programs. This is due to its non-standard evaluation (need to understand more what it means).

Replace use of subset() in the package with [ and [[. $ is also not ideal, as it allows partial matching.

add option to create EML from big entities quickly

create_entity() takes a bit of time to process larger files. this is due to several tasks that parse the files themselves:

  • calculate checksums and file sizes
  • checks for attribute congruence (e.g. has to parse data table to see if codes defined in metadata are present in data and vice versa)

we need an option to bypass these if user wants to spit out EML quickly

handle missing VIEWs

users might have a different set of views than get_meta expects. we need the function to fail more gracefully than it does now

better locating of flat files

right now flat files need to be placed in same directory as the current R working directory.

add an argument to let user specify a directory in which to find files (abstract + method documents, data files). Default to current working directory.

improve message when codes in data and metadata don't match

Example: add extra codes in metadata that aren't in data, such as having type.NA in both missing values and enumerations tables. The message is currently "mismatched codes," but instead we should list the codes that weren't in metadata, and that weren't in data.

need option of textFormat

@gremau

On Fri, Apr 2, 2021, 10:58 Gregory E. Maurer [email protected] wrote:
Hi An,

I'm not sure if this is a metabase or a MetaEgress issue - perhaps its a limitation of MetaEgress. The issue I was having occurred when I used MetaEgress to make EML for a data package that includes an otherEntity that is a text file. When I did this with MetaEgress, the EML wouldn't validate, giving this error:

[1] FALSE
attr(,"errors")
[1] "Element 'dataFormat': Missing child element(s). Expected is one of ( textFormat, externallyDefinedFormat, binaryRasterFormat )."

I was able to work around this by putting 'textFormat', or any other string, into the output of create_entity_all (tables_pkg), like this:

tables_pkg$other_entities[[1]]$physical$dataFormat$externallyDefinedFormat$formatName <- 'textFormat'

Or by adding 'textFormat' to JRN Metabase in EMLFileTypes.externallyDefinedFormat_formatName.

  • In any case - doing one of these lets me make valid EML* - but read on for more thoughts about whether MetaEgress needs something.

I guess the root of the problem was that when you have an otherEntity that has a dataFormat of "externallyDefinedFormat" EML validation expects a value for "formatName". However, otherEntity I am adding is just a free text file, and I have described this type of entity in the EMLFileType table without any externallyDefinedFormat_formatName because I didn't think it really was an externallyDefinedFormat. MetaEgress seems to classify all otherEntities as externallyDefinedFormat instead of other options of textFormat or binaryRaster (if I'm reading your code right, and reading the EML schema right). So I needed to manually assign EMLFileTypes.externallyDefinedFormat_formatName to my custom textFormat EML FileType in my metabase, or insert that value into the list manually before making EML.

Not sure if this will makes sense without looking over things - happy to meet with you, and if you think this means there is an enhancement needed in MetaEgress I could help with that. I could also be misreading code and the EML schema a bit - let me know if you think that is the case. Its kind of tricky to sort out the mapping between the EML Schema, a metabase, and the EML datatypes in MetaEgress.

Greg

hyperlink in word document

If the word document (e.g. abstract or method) has a hyperlink, the EML will be invalid. The code could either remove the hyperlink from the word document or change into other format that EML accepts.

handle multiple user IDs

current vw_eml_creator will generate multiple rows per creator if more than one userId is present. Current MetaEgress handling will generate more than one tag per person differing only in userIds if this happens.

Proposed handling: add nameID to view, loop through each unique instance of nameID and assign multiple userIds.

create semantic annotation

Now that core-mb has SA tables:

steps to generate <annotations> elements

  • assign IDs to all attributes and datasets
  • inject annotation to attributeList after EML::set_attributes

add EML generation info to additional metadata?

some items we might need from old EML docs:

  • datetime of query from MB
  • version of MB queried
  • version of MetaEgress used to generate EML
  • version of EML R pkg used
  • datetime EML was generated
  • datetime EML doc was serialized (different from above because you can totally generate the list structure in R way before writing it to .xml file)
  • datetime EML was last modified. listing here for completeness, in practice no way in ME you can do this, and you'd look at the file system date modified anyway

in get_meta.r, consider separating configurable sections from main function

An - this is great! and so easy to read! I have a suggestion:
some of the components in the main function seem like they could be set outside of the function and then passed in. In particular:
con <-dbConnect( ...
views_expected <- c(...
names_short <- c( ...

I think they would be easier to keep up to date that way.

apostrophe messes up row count when computing number of records

An apostrophe in a CSV cell value in a data table, e.g., "I've had it up to here with bugs", messes up the row count when computing number of records in create_entity. I found that it added an extra record in my case. Perhaps rewrite the function so it has a more robust row counting scheme. RStudio shows the correct number of rows in the data frame, so I know it's possible.

Customized unit

The current code doesn't work for the customized unit. For example, if you put it an unit "milligramPerGramPerHourPerPhotonFlux", the create_entity function fails.

I run the code on Mac.

handle altitudes

current VIEWs read in altitudes but do not insert them into EML

look into using dplyr to query directly from database (skip VIEWs)

dplyr can connect to PostgreSQL directly and query from a database. in other words, there's no need for the mb2eml_r schema because we can construct dataframes equivalent to mb2eml_r VIEWs directly from within R. This is different than our current approach.

Potential gains:

  • less infrastructure to maintain in metabase
  • more mobile: right now if we want to change the query for a VIEW or add one, metabase needs to be updated and all our users need to update their core-mb installation. Not ideal once core-mb has been distributed.

Might be some upfront work, but there is no need to revamp anything once metadata has been queried in. We can reuse the structure we have now.

@lkuiucsb what do you think?

additional metadata

If a data package has only "otherEntity", there won't be "additional metadata". The current Rcode would generate an EML node with "NULL" in the additional metadata. The code should be able to remove the node when there is no additional metadata.

EDIutils 0.0.0.9000

Hi @atn38!

EDIutils has undergone a major refactor for submission to rOpenSci and CRAN. This new and improved version covers the full data repository REST API, handles authentication more securely, better matches API call and result syntax, improves documentation, and opens the door for development of wrapper functions to support common data management tasks. In the process of this refactor the function names and call patterns have changed and several functions supporting other EDI R packages have been removed, thereby creating back compatibility breaking changes with the previous major release (version 1.6.1). The previous version will be available until 2022-06-01 on the deprecated branch. Install the previous version with:

remotes::install_github("EDIorg/EDIutils", ref = "deprecated")

EDIutils functions used in your code and suggested replacements

  • Replace api_get_provenance_metadata() with get_provenance_metadata().

Please let me know if you have any questions,
@clnsmth

handle empty query results

EML R package will not create elements if it is supplied NAs or NULL elements, but will fail silently with an entirely empty data frame.
possible solution: insert fake rows with user-specified datasetid and NAs in all other columns if query to metabase returns empty.

software and instrument

For the data package that has the instrument or software, the current code was not able to generate the corresponding xml nodes.

Unexpected results and invalid EML with "expand_taxa=FALSE" in create_EML

I recently updated MetaEgress and updated my workflow to use the "expand_taxa" and "skip_taxa" arguments to the create_EML function. When I set "expand_taxa=TRUE" my taxa are expanded into a nice tree in the resulting EML. When I set "expand_taxa=FALSE" an invalid EML document is produced. No taxa expansion happens (as expected), but there are some elements in the resulting <taxonomicClassification> element that won't validate (I think <commonname> is the problem but not sure). One good thing about "expand_taxa=FALSE" is that there is a <taxonId> element with the provider="https://itis.gov" attribute. This element does not appear with "expand_taxa=TRUE" as I was originally expecting.

It seems that "expand_taxa=FALSE" should still give valid EML with a taxonomicCoverage element, but I'm not sure where things are going wrong. Let me know if anyone has thoughts on how to correct this. 2 EML documents are attached (=FALSE and =TRUE cases)

knb-lter-jrn.210121001.62_expandfalse.xml.txt

knb-lter-jrn.210121001.62_expandtrue.xml.txt

check for edge NA cases thoroughly

what this looks like is a bunch of if/else statements to check for NAs in query results and assign NULLs to EML list items.

consider making an util function to call

handle check_attributes_congruence length mismatch error

In the block below, we should see a message if the metadata doesn't match the data. However, if the number of unique values in data does not match the number of enumeration codes, we get an unhandled error instead.

      if (!all(unique(entity_df[[i]]) %in% c(cats, codes) | c(cats, codes) %in% unique(entity_df[[i]]))) {
        msg <- paste(
          "Enumeration in attribute",
          i,
          "in metadata not matching that in data for entity",
          entity_name
        )
        output_msgs <- c(output_msgs, msg)
      }

The error reads:

  longer object length is not a multiple of shorter object length

To handle the error, we should first check the lengths. Whether there's a length mismatch or some other mismatch, we should also at least report which attribute and table we were checking.

refactor create_entity() to smartly handle entity types

create_entity now handles only two entity types: dataTable and otherEntity. Users also has to jot down which entity numbers correspond to which type and pass them in as arguments accordingly.

A better way to handle this would be to make create_entity smarter:

  • remove entity numbers as arguments
  • detect entityType from within the function
  • return a list of lists, grouped by entityType

create_EML would also need to be smarter in conjunction:

  • only take one entities argument instead of users having to specify separate objects for each type
  • smartly detect which entityTypes are present in single list output from create_entity and insert into appropriate EML slots

optional features:

  • include exclude_entity argument in create_entity to allow exclusion of entities if desired

Invalid unmatched <additionalMetadata> produced at end of some EML documents

I find this: <additionalMetadata/>

At the end of some of the EML documents I create using MetaEgress, but not with all. This tag is not matched by any other additionalMetadata elements in the document and is invalid. The pattern I see is that it only happens when I make EML that has all otherEntities, and no dataTables. Can't explain why yet but will investigate.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.