ble-lter / metaegress Goto Github PK
View Code? Open in Web Editor NEWR package to create Ecological Metadata Language documents from an instance of LTER-core-metabase database schema
Home Page: https://BLE-LTER.github.io/MetaEgress/
R package to create Ecological Metadata Language documents from an instance of LTER-core-metabase database schema
Home Page: https://BLE-LTER.github.io/MetaEgress/
The taxonomic capabilities of metabase and MetaEgress have been untested so far AFAIK. They can be reviewed and revised for robustness in anticipation of BLE LTER needing them. In addition, EML
moved away from using taxize
for their set_taxonomicCoverage
function, instead using taxadb
which supports fewer taxonomic providers and notably doesn't support WORRMS.
MetaEgress needs to decide:
Ideas:
taxonomyCleanr
to be even with core-mb":
According to subset()
documentation and general R programming resources, subset()
is better used interactively than in programs. This is due to its non-standard evaluation (need to understand more what it means).
Replace use of subset()
in the package with [
and [[
. $
is also not ideal, as it allows partial matching.
part of #21
create_entity()
takes a bit of time to process larger files. this is due to several tasks that parse the files themselves:
we need an option to bypass these if user wants to spit out EML quickly
...please?
๐ ๐
users might have a different set of views than get_meta
expects. we need the function to fail more gracefully than it does now
see lter/LTER-core-metabase#46
should be easy enough modification to get_meta
right now flat files need to be placed in same directory as the current R working directory.
add an argument to let user specify a directory in which to find files (abstract + method documents, data files). Default to current working directory.
The one function to do them all. About a dozen arguments.
The component functions should still be exposed.
Example: add extra codes in metadata that aren't in data, such as having type.NA in both missing values and enumerations tables. The message is currently "mismatched codes," but instead we should list the codes that weren't in metadata, and that weren't in data.
the ones not yet in ECC
On Fri, Apr 2, 2021, 10:58 Gregory E. Maurer [email protected] wrote:
Hi An,
I'm not sure if this is a metabase or a MetaEgress issue - perhaps its a limitation of MetaEgress. The issue I was having occurred when I used MetaEgress to make EML for a data package that includes an otherEntity that is a text file. When I did this with MetaEgress, the EML wouldn't validate, giving this error:
[1] FALSE
attr(,"errors")
[1] "Element 'dataFormat': Missing child element(s). Expected is one of ( textFormat, externallyDefinedFormat, binaryRasterFormat )."
I was able to work around this by putting 'textFormat', or any other string, into the output of create_entity_all (tables_pkg), like this:
tables_pkg$other_entities[[1]]$physical$dataFormat$externallyDefinedFormat$formatName <- 'textFormat'
Or by adding 'textFormat' to JRN Metabase in EMLFileTypes.externallyDefinedFormat_formatName.
I guess the root of the problem was that when you have an otherEntity that has a dataFormat of "externallyDefinedFormat" EML validation expects a value for "formatName". However, otherEntity I am adding is just a free text file, and I have described this type of entity in the EMLFileType table without any externallyDefinedFormat_formatName because I didn't think it really was an externallyDefinedFormat. MetaEgress seems to classify all otherEntities as externallyDefinedFormat instead of other options of textFormat or binaryRaster (if I'm reading your code right, and reading the EML schema right). So I needed to manually assign EMLFileTypes.externallyDefinedFormat_formatName to my custom textFormat EML FileType in my metabase, or insert that value into the list manually before making EML.
Not sure if this will makes sense without looking over things - happy to meet with you, and if you think this means there is an enhancement needed in MetaEgress I could help with that. I could also be misreading code and the EML schema a bit - let me know if you think that is the case. Its kind of tricky to sort out the mapping between the EML Schema, a metabase, and the EML datatypes in MetaEgress.
Greg
via testthat
package
If the word document (e.g. abstract or method) has a hyperlink, the EML will be invalid. The code could either remove the hyperlink from the word document or change into other format that EML accepts.
condense into one message with comma-separated names of missing/unexpected views
current vw_eml_creator will generate multiple rows per creator if more than one userId is present. Current MetaEgress
handling will generate more than one tag per person differing only in userIds if this happens.
Proposed handling: add nameID to view, loop through each unique instance of nameID and assign multiple userIds.
Now that core-mb has SA tables:
steps to generate <annotations>
elements
EML::set_attributes
additional argument in entity creation to not require the flat files
instead of reading in from file
this by a roundabout way is related to #14
some items we might need from old EML docs:
An - this is great! and so easy to read! I have a suggestion:
some of the components in the main function seem like they could be set outside of the function and then passed in. In particular:
con <-dbConnect( ...
views_expected <- c(...
names_short <- c( ...
I think they would be easier to keep up to date that way.
An apostrophe in a CSV cell value in a data table, e.g., "I've had it up to here with bugs", messes up the row count when computing number of records in create_entity. I found that it added an extra record in my case. Perhaps rewrite the function so it has a more robust row counting scheme. RStudio shows the correct number of rows in the data frame, so I know it's possible.
The current code doesn't work for the customized unit. For example, if you put it an unit "milligramPerGramPerHourPerPhotonFlux", the create_entity function fails.
I run the code on Mac.
current VIEWs read in altitudes but do not insert them into EML
might be built into create_EML
or might be its own separate function. need to see how boilerplate items are handled in metabase.
dplyr
can connect to PostgreSQL directly and query from a database. in other words, there's no need for the mb2eml_r
schema because we can construct dataframes equivalent to mb2eml_r
VIEWs directly from within R. This is different than our current approach.
Potential gains:
Might be some upfront work, but there is no need to revamp anything once metadata has been queried in. We can reuse the structure we have now.
@lkuiucsb what do you think?
If a data package has only "otherEntity", there won't be "additional metadata". The current Rcode would generate an EML node with "NULL" in the additional metadata. The code should be able to remove the node when there is no additional metadata.
Hi @atn38!
EDIutils has undergone a major refactor for submission to rOpenSci and CRAN. This new and improved version covers the full data repository REST API, handles authentication more securely, better matches API call and result syntax, improves documentation, and opens the door for development of wrapper functions to support common data management tasks. In the process of this refactor the function names and call patterns have changed and several functions supporting other EDI R packages have been removed, thereby creating back compatibility breaking changes with the previous major release (version 1.6.1). The previous version will be available until 2022-06-01 on the deprecated
branch. Install the previous version with:
remotes::install_github("EDIorg/EDIutils", ref = "deprecated")
EDIutils functions used in your code and suggested replacements
api_get_provenance_metadata()
with get_provenance_metadata()
.Please let me know if you have any questions,
@clnsmth
EML R package will not create elements if it is supplied NAs or NULL elements, but will fail silently with an entirely empty data frame.
possible solution: insert fake rows with user-specified datasetid and NAs in all other columns if query to metabase returns empty.
For the data package that has the instrument or software, the current code was not able to generate the corresponding xml nodes.
I recently updated MetaEgress and updated my workflow to use the "expand_taxa" and "skip_taxa" arguments to the create_EML function. When I set "expand_taxa=TRUE" my taxa are expanded into a nice tree in the resulting EML. When I set "expand_taxa=FALSE" an invalid EML document is produced. No taxa expansion happens (as expected), but there are some elements in the resulting <taxonomicClassification> element that won't validate (I think <commonname> is the problem but not sure). One good thing about "expand_taxa=FALSE" is that there is a <taxonId> element with the provider="https://itis.gov" attribute. This element does not appear with "expand_taxa=TRUE" as I was originally expecting.
It seems that "expand_taxa=FALSE" should still give valid EML with a taxonomicCoverage element, but I'm not sure where things are going wrong. Let me know if anyone has thoughts on how to correct this. 2 EML documents are attached (=FALSE and =TRUE cases)
what this looks like is a bunch of if/else statements to check for NAs in query results and assign NULLs to EML list items.
consider making an util function to call
In the block below, we should see a message if the metadata doesn't match the data. However, if the number of unique values in data does not match the number of enumeration codes, we get an unhandled error instead.
if (!all(unique(entity_df[[i]]) %in% c(cats, codes) | c(cats, codes) %in% unique(entity_df[[i]]))) {
msg <- paste(
"Enumeration in attribute",
i,
"in metadata not matching that in data for entity",
entity_name
)
output_msgs <- c(output_msgs, msg)
}
The error reads:
longer object length is not a multiple of shorter object length
To handle the error, we should first check the lengths. Whether there's a length mismatch or some other mismatch, we should also at least report which attribute and table we were checking.
create_entity
now handles only two entity types: dataTable and otherEntity. Users also has to jot down which entity numbers correspond to which type and pass them in as arguments accordingly.
A better way to handle this would be to make create_entity
smarter:
create_EML
would also need to be smarter in conjunction:
entities
argument instead of users having to specify separate objects for each typecreate_entity
and insert into appropriate EML slotsoptional features:
exclude_entity
argument in create_entity
to allow exclusion of entities if desirednot RStudio-dependent.
I find this: <additionalMetadata/>
At the end of some of the EML documents I create using MetaEgress, but not with all. This tag is not matched by any other additionalMetadata elements in the document and is invalid. The pattern I see is that it only happens when I make EML that has all otherEntities, and no dataTables. Can't explain why yet but will investigate.
this is due to an if statement throwing error if there are more than one condition
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.