ble-lter / metaegress Goto Github PK

View Code? Open in Web Editor NEW

5.0 6.0 3.0 658 KB

R package to create Ecological Metadata Language documents from an instance of LTER-core-metabase database schema

Home Page: https://BLE-LTER.github.io/MetaEgress/

R 53.33% PLpgSQL 46.62% Rez 0.05%

r metadata lter postgresql r-package xml eml-metadata eml eml-files

metaegress's People

Stargazers

Watchers

Forkers

mobb redbeardlegend jornada-im

metaegress's Issues

make taxonomy processing more robust, set when to rely on EML and when not to

The taxonomic capabilities of metabase and MetaEgress have been untested so far AFAIK. They can be reviewed and revised for robustness in anticipation of BLE LTER needing them. In addition, EML moved away from using taxize for their set_taxonomicCoverage function, instead using taxadb which supports fewer taxonomic providers and notably doesn't support WORRMS.

MetaEgress needs to decide:

how to support providers not supported by taxadb
how to support multiple providers per set of supplied taxonomies

Ideas:

use EDI's taxonomyCleanr
develop different handling for different scenarios

fix userid

to be even with core-mb":

remove concatenation of "orcid.org" to user IDs. paste in the whole thing
use userid_directory in view

replace use of subset() throughout package with standard subsetting functions

According to subset() documentation and general R programming resources, subset() is better used interactively than in programs. This is due to its non-standard evaluation (need to understand more what it means).

Replace use of subset() in the package with [ and [[. $ is also not ideal, as it allows partial matching.

add taxonomic coverage

add maintenance

part of #21

add option to create EML from big entities quickly

create_entity() takes a bit of time to process larger files. this is due to several tasks that parse the files themselves:

calculate checksums and file sizes
checks for attribute congruence (e.g. has to parse data table to see if codes defined in metadata are present in data and vice versa)

we need an option to bypass these if user wants to spit out EML quickly

Bring this into BLE's GitHub organization

...please?

🙏 😄

handle missing VIEWs

users might have a different set of views than get_meta expects. we need the function to fail more gracefully than it does now

allow mysql views as input

see lter/LTER-core-metabase#46

should be easy enough modification to get_meta

better locating of flat files

right now flat files need to be placed in same directory as the current R working directory.

add an argument to let user specify a directory in which to find files (abstract + method documents, data files). Default to current working directory.

rewrite methods section according to new core-mb design

see lter/LTER-core-metabase#60

write one-stop-shop function

The one function to do them all. About a dozen arguments.
The component functions should still be exposed.

improve message when codes in data and metadata don't match

Example: add extra codes in metadata that aren't in data, such as having type.NA in both missing values and enumerations tables. The message is currently "mismatched codes," but instead we should list the codes that weren't in metadata, and that weren't in data.

bring to even with latest lter-core-mb functionalities

implement congruency checks between data & metadata

the ones not yet in ECC

attribute order match between data header and EML
attribute names match between data header and EML
attribute missing codes and enumeration match between data and EML

need option of textFormat

@gremau

On Fri, Apr 2, 2021, 10:58 Gregory E. Maurer [email protected] wrote:
Hi An,

I'm not sure if this is a metabase or a MetaEgress issue - perhaps its a limitation of MetaEgress. The issue I was having occurred when I used MetaEgress to make EML for a data package that includes an otherEntity that is a text file. When I did this with MetaEgress, the EML wouldn't validate, giving this error:

[1] FALSE
attr(,"errors")
[1] "Element 'dataFormat': Missing child element(s). Expected is one of ( textFormat, externallyDefinedFormat, binaryRasterFormat )."

I was able to work around this by putting 'textFormat', or any other string, into the output of create_entity_all (tables_pkg), like this:

tables_pkg$other_entities[[1]]$physical$dataFormat$externallyDefinedFormat$formatName <- 'textFormat'

Or by adding 'textFormat' to JRN Metabase in EMLFileTypes.externallyDefinedFormat_formatName.

In any case - doing one of these lets me make valid EML* - but read on for more thoughts about whether MetaEgress needs something.

I guess the root of the problem was that when you have an otherEntity that has a dataFormat of "externallyDefinedFormat" EML validation expects a value for "formatName". However, otherEntity I am adding is just a free text file, and I have described this type of entity in the EMLFileType table without any externallyDefinedFormat_formatName because I didn't think it really was an externallyDefinedFormat. MetaEgress seems to classify all otherEntities as externallyDefinedFormat instead of other options of textFormat or binaryRaster (if I'm reading your code right, and reading the EML schema right). So I needed to manually assign EMLFileTypes.externallyDefinedFormat_formatName to my custom textFormat EML FileType in my metabase, or insert that value into the list manually before making EML.

Not sure if this will makes sense without looking over things - happy to meet with you, and if you think this means there is an enhancement needed in MetaEgress I could help with that. I could also be misreading code and the EML schema a bit - let me know if you think that is the case. Its kind of tricky to sort out the mapping between the EML Schema, a metabase, and the EML datatypes in MetaEgress.

Greg

write unit tests

via testthat package

hyperlink in word document

If the word document (e.g. abstract or method) has a hyperlink, the EML will be invalid. The code could either remove the hyperlink from the word document or change into other format that EML accepts.

multiple missing/unexpected views generate multiple warning messages

condense into one message with comma-separated names of missing/unexpected views

handle multiple user IDs

current vw_eml_creator will generate multiple rows per creator if more than one userId is present. Current MetaEgress handling will generate more than one tag per person differing only in userIds if this happens.

Proposed handling: add nameID to view, loop through each unique instance of nameID and assign multiple userIds.

create semantic annotation

Now that core-mb has SA tables:

steps to generate <annotations> elements

assign IDs to all attributes and datasets
inject annotation to attributeList after EML::set_attributes

allow creation of EML only

additional argument in entity creation to not require the flat files

allow option to pass in filesize and cksum from metabase

instead of reading in from file

this by a roundabout way is related to #14

add EML generation info to additional metadata?

some items we might need from old EML docs:

datetime of query from MB
version of MB queried
version of MetaEgress used to generate EML
version of EML R pkg used
datetime EML was generated
datetime EML doc was serialized (different from above because you can totally generate the list structure in R way before writing it to .xml file)
datetime EML was last modified. listing here for completeness, in practice no way in ME you can do this, and you'd look at the file system date modified anyway

in get_meta.r, consider separating configurable sections from main function

An - this is great! and so easy to read! I have a suggestion:
some of the components in the main function seem like they could be set outside of the function and then passed in. In particular:
con <-dbConnect( ...
views_expected <- c(...
names_short <- c( ...

I think they would be easier to keep up to date that way.

apostrophe messes up row count when computing number of records

An apostrophe in a CSV cell value in a data table, e.g., "I've had it up to here with bugs", messes up the row count when computing number of records in create_entity. I found that it added an extra record in my case. Perhaps rewrite the function so it has a more robust row counting scheme. RStudio shows the correct number of rows in the data frame, so I know it's possible.

add documentation via roxygen2

make call to PASTA/EDI API to generate provenance from packageId

Customized unit

The current code doesn't work for the customized unit. For example, if you put it an unit "milligramPerGramPerHourPerPhotonFlux", the create_entity function fails.

I run the code on Mac.

handle altitudes

current VIEWs read in altitudes but do not insert them into EML

check for expected boilerplate items

might be built into create_EML or might be its own separate function. need to see how boilerplate items are handled in metabase.

implement the `SubtractFromID` idea

see lter/LTER-core-metabase#84

look into using dplyr to query directly from database (skip VIEWs)

dplyr can connect to PostgreSQL directly and query from a database. in other words, there's no need for the mb2eml_r schema because we can construct dataframes equivalent to mb2eml_r VIEWs directly from within R. This is different than our current approach.

Potential gains:

less infrastructure to maintain in metabase
more mobile: right now if we want to change the query for a VIEW or add one, metabase needs to be updated and all our users need to update their core-mb installation. Not ideal once core-mb has been distributed.

Might be some upfront work, but there is no need to revamp anything once metadata has been queried in. We can reuse the structure we have now.

@lkuiucsb what do you think?

additional metadata

If a data package has only "otherEntity", there won't be "additional metadata". The current Rcode would generate an EML node with "NULL" in the additional metadata. The code should be able to remove the node when there is no additional metadata.

EDIutils 0.0.0.9000

Hi @atn38!

EDIutils has undergone a major refactor for submission to rOpenSci and CRAN. This new and improved version covers the full data repository REST API, handles authentication more securely, better matches API call and result syntax, improves documentation, and opens the door for development of wrapper functions to support common data management tasks. In the process of this refactor the function names and call patterns have changed and several functions supporting other EDI R packages have been removed, thereby creating back compatibility breaking changes with the previous major release (version 1.6.1). The previous version will be available until 2022-06-01 on the deprecated branch. Install the previous version with:

remotes::install_github("EDIorg/EDIutils", ref = "deprecated")

EDIutils functions used in your code and suggested replacements

Replace api_get_provenance_metadata() with get_provenance_metadata().

Please let me know if you have any questions,
@clnsmth

handle empty query results

EML R package will not create elements if it is supplied NAs or NULL elements, but will fail silently with an entirely empty data frame.
possible solution: insert fake rows with user-specified datasetid and NAs in all other columns if query to metabase returns empty.

software and instrument

For the data package that has the instrument or software, the current code was not able to generate the corresponding xml nodes.

Unexpected results and invalid EML with "expand_taxa=FALSE" in create_EML

I recently updated MetaEgress and updated my workflow to use the "expand_taxa" and "skip_taxa" arguments to the create_EML function. When I set "expand_taxa=TRUE" my taxa are expanded into a nice tree in the resulting EML. When I set "expand_taxa=FALSE" an invalid EML document is produced. No taxa expansion happens (as expected), but there are some elements in the resulting <taxonomicClassification> element that won't validate (I think <commonname> is the problem but not sure). One good thing about "expand_taxa=FALSE" is that there is a <taxonId> element with the provider="https://itis.gov" attribute. This element does not appear with "expand_taxa=TRUE" as I was originally expecting.

It seems that "expand_taxa=FALSE" should still give valid EML with a taxonomicCoverage element, but I'm not sure where things are going wrong. Let me know if anyone has thoughts on how to correct this. 2 EML documents are attached (=FALSE and =TRUE cases)

knb-lter-jrn.210121001.62_expandfalse.xml.txt

knb-lter-jrn.210121001.62_expandtrue.xml.txt

check for edge NA cases thoroughly

what this looks like is a bunch of if/else statements to check for NAs in query results and assign NULLs to EML list items.

consider making an util function to call

handle different text type in abstract and method descriptions: markdown/docbook/plaintext/read from file

handle check_attributes_congruence length mismatch error

In the block below, we should see a message if the metadata doesn't match the data. However, if the number of unique values in data does not match the number of enumeration codes, we get an unhandled error instead.

      if (!all(unique(entity_df[[i]]) %in% c(cats, codes) | c(cats, codes) %in% unique(entity_df[[i]]))) {
        msg <- paste(
          "Enumeration in attribute",
          i,
          "in metadata not matching that in data for entity",
          entity_name
        )
        output_msgs <- c(output_msgs, msg)
      }

The error reads:

  longer object length is not a multiple of shorter object length

To handle the error, we should first check the lengths. Whether there's a length mismatch or some other mismatch, we should also at least report which attribute and table we were checking.

refactor create_entity() to smartly handle entity types

create_entity now handles only two entity types: dataTable and otherEntity. Users also has to jot down which entity numbers correspond to which type and pass them in as arguments accordingly.

A better way to handle this would be to make create_entity smarter:

remove entity numbers as arguments
detect entityType from within the function
return a list of lists, grouped by entityType

create_EML would also need to be smarter in conjunction:

only take one entities argument instead of users having to specify separate objects for each type
smartly detect which entityTypes are present in single list output from create_entity and insert into appropriate EML slots

optional features:

include exclude_entity argument in create_entity to allow exclusion of entities if desired

make sure package works in base R console

not RStudio-dependent.

Invalid unmatched <additionalMetadata> produced at end of some EML documents

I find this: <additionalMetadata/>

At the end of some of the EML documents I create using MetaEgress, but not with all. This tag is not matched by any other additionalMetadata elements in the document and is invalid. The pattern I see is that it only happens when I make EML that has all otherEntities, and no dataTables. Can't explain why yet but will investigate.

congruence check throws error if >1 missing codes per attribute

this is due to an if statement throwing error if there are more than one condition

ble-lter / metaegress Goto Github PK

metaegress's People

Stargazers

Watchers

Forkers

metaegress's Issues

Recommend Projects

Recommend Topics

Recommend Org