ropensci / dataspice Goto Github PK

View Code? Open in Web Editor NEW

157.0 15.0 26.0 3.17 MB

:hot_pepper: Create lightweight schema.org descriptions of your datasets

Home Page: https://docs.ropensci.org/dataspice

License: Other

R 100.00%

unconf18 schema-org metadata data dataset r r-package rstats unconf

dataspice's Introduction

dataspice

The goal of dataspice is to make it easier for researchers to create basic, lightweight, and concise metadata files for their datasets by editing the kind of files they’re probably most familiar with: CSVs. To spice up their data with a dash of metadata. These metadata files can then be used to:

Make useful information available during analysis.
Create a helpful dataset README webpage for your data similar to how pkgdown creates websites for R packages.
Produce more complex metadata formats for richer description of your datasets and to aid dataset discovery.

Metadata fields are based on Schema.org/Dataset and other metadata standards and represent a lowest common denominator which means converting between formats should be relatively straightforward.

Example

An basic example repository for demonstrating what using dataspice might look like can be found at https://github.com/amoeba/dataspice-example. From there, you can also check out a preview of the HTML dataspice generates at https://amoeba.github.io/dataspice-example and how Google sees it at https://search.google.com/test/rich-results?url=https%3A%2F%2Famoeba.github.io%2Fdataspice-example%2F.

A much more detailed example has been created by Anna Krystalli at https://annakrystalli.me/dataspice-tutorial/ (GitHub repo).

Installation

You can install the latest version from CRAN:

install.packages("dataspice")

Workflow

create_spice()
# Then fill in template CSV files, more on this below
write_spice()
build_site() # Optional

Create spice

create_spice() creates template metadata spreadsheets in a folder (by default created in the data folder in the current working directory).

The template files are:

biblio.csv - for title, abstract, spatial and temporal coverage, etc.
creators.csv - for data authors
attributes.csv - explains each of the variables in the dataset
access.csv - for files, file types, and download URLs (if appropriate)

Fill in templates

The user needs to fill in the details of the four template files. These csv files can be directly modified, or they can be edited using either the associated helper function and/or Shiny app.

Helper functions

prep_attributes() populates the fileName and variableName columns of the attributes.csv file using the header row of the data files.
prep_access() populates the fileName, name and encodingFormat columns of the access.csv file from the files in the folder containing the data.

To see an example of how prep_attributes() works, load the data files that ship with the package:

data_files <- list.files(system.file("example-dataset/", package = "dataspice"),
  pattern = ".csv",
  full.names = TRUE
)

This function assumes that the metadata templates are in a folder called metadata within a data folder.

attributes_path <- file.path("data", "metadata", "attributes.csv")

Using purrr::map(), this function can be applied over multiple files to populate the header names

data_files %>%
  purrr::map(~ prep_attributes(.x, attributes_path),
    attributes_path = attributes_path
  )

The output of prep_attributes() has the first two columns filled out:

fileName	variableName	description	unitText
BroodTables.csv	Stock.ID	NA	NA
BroodTables.csv	Species	NA	NA
BroodTables.csv	Stock	NA	NA
BroodTables.csv	Ocean.Region	NA	NA
BroodTables.csv	Region	NA	NA
BroodTables.csv	Sub.Region	NA	NA

Shiny helper apps

Each of the metadata templates can be edited interactively using a Shiny app:

edit_attributes() opens a Shiny app that can be used to edit attributes.csv. The Shiny app displays the current attributes table and lets the user fill in an informative description and units (e.g. meters, hectares, etc.) for each variable.
edit_access() opens an editable version of access.csv
edit_creators() opens an editable version of creators.csv
edit_biblio() opens an editable version of biblio.csv

Remember to click on Save when finished editing.

Completed metadata files

The first few rows of the completed metadata tables in this example will look like this:

access.csv has one row for each file

fileName	name	contentUrl	encodingFormat
StockInfo.csv	StockInfo.csv	NA	CSV
BroodTables.csv	BroodTables.csv	NA	CSV
SourceInfo.csv	SourceInfo.csv	NA	CSV

attributes.csv has one row for each variable in each file

fileName	variableName	description	unitText
BroodTables.csv	Stock.ID	Unique stock identifier	NA
BroodTables.csv	Species	species of stock	NA
BroodTables.csv	Stock	Stock name, generally river where stock is found	NA
BroodTables.csv	Ocean.Region	Ocean region	NA
BroodTables.csv	Region	Region of stock	NA
BroodTables.csv	Sub.Region	Sub.Region of stock	NA

biblio.csv is one row containing descriptors including spatial and temporal coverage

title	description	datePublished	citation	keywords	license	funder	geographicDescription	northBoundCoord	eastBoundCoord	southBoundCoord	westBoundCoord	wktString	startDate	endDate
Compiled annual statewide Alaskan salmon escapement counts, 1921-2017	The number of mature salmon migrating from the marine environment to freshwater streams is defined as escapement. Escapement data are the enumeration of these migrating fish as they pass upstream, …	2018-02-12 08:00:00	NA	salmon, alaska, escapement	NA	NA	NA	78	-131	47	-171	NA	1921-01-01 08:00:00	2017-01-01 08:00:00

creators.csv has one row for each of the dataset authors

id	name	affiliation	email
NA	Jeanette Clark	National Center for Ecological Analysis and Synthesis	[email protected]
NA	Rich,Brenner	Alaska Department of Fish and Game	richard.brenner.alaska.gov

Save JSON-LD file

write_spice() generates a json-ld file (“linked data”) to aid in dataset discovery, creation of more extensive metadata (e.g. EML), and creating a website.

Here’s a view of the dataspice.json file of the example data:

Build website

build_site() creates a bare-bones index.html file in the repository docs folder with a simple view of the dataset with the metadata and an interactive map. For example, this repository results in this website

Convert to EML

The metadata fields dataspice uses are based largely on their compatibility with terms from Schema.org. However, dataspice metadata can be converted to Ecological Metadata Language (EML), a much richer schema. The conversion isn’t perfect but dataspice will do its best to convert your dataspice metadata to EML:

library(dataspice)

# Load an example dataspice JSON that comes installed with the package
spice <- system.file(
  "examples", "annual-escapement.json",
  package = "dataspice"
)

# Convert it to EML
eml_doc <- spice_to_eml(spice)
#> Warning: variableMeasured not crosswalked to EML because we don't have enough
#> information. Use `crosswalk_variables` to create the start of an EML attributes
#> table. See ?crosswalk_variables for help.
#> You might want to run EML::eml_validate on the result at this point and fix what validations errors are produced. You will commonly need to set `packageId`, `system`, and provide `attributeList` elements for each `dataTable`.

You may receive warnings depending on which dataspice fields you filled in and this process will very likely produce an invalid EML record which is totally fine:

library(EML)
#> 
#> Attaching package: 'EML'
#> The following object is masked from 'package:magrittr':
#> 
#>     set_attributes

eml_validate(eml_doc)
#> [1] FALSE
#> attr(,"errors")
#> [1] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'packageId' is required but missing."                                  
#> [2] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'system' is required but missing."                                     
#> [3] "Element 'dataTable': Missing child element(s). Expected is one of ( physical, coverage, methods, additionalInfo, annotation, attributeList )."
#> [4] "Element 'dataTable': Missing child element(s). Expected is one of ( physical, coverage, methods, additionalInfo, annotation, attributeList )."
#> [5] "Element 'dataTable': Missing child element(s). Expected is one of ( physical, coverage, methods, additionalInfo, annotation, attributeList )."

This is because some fields in dataspice store information in different structures and because EML requires many fields that dataspice doesn’t have fields for. At this point, you should look over the validation errors produced by EML::eml_validate and fix those. Note that this will likely require familiarity with the EML Schema and the EML package.

Once you’re done, you can write out an EML XML file:

out_path <- tempfile()
write_eml(eml_doc, out_path)
#> NULL

Convert from EML

Like converting dataspice to EML, we can convert an existing EML record to a set of dataspice metadata tables which we can then work from within dataspice:

library(EML)

eml_path <- system.file("example-dataset/broodTable_metadata.xml", package = "dataspice")
eml <- read_eml(eml_path)

# Creates four CSVs files in the `data/metadata` directory
my_spice <- eml_to_spice(eml, "data/metadata")

Resources

A few existing tools & data standards to help users in specific domains:

Darwin Core
Ecological Metadata Language (EML) (& EML)
ISO 19115 - Geographic Information Metadata
ISO 19139 - Geographic Info Metadata XML schema
Minimum Information for Biological and Biomedical Investigations (MIBBI)

…And others indexed in Fairsharing.org & the RDA metadata directory.

Code of Conduct

Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Contributors

This package was developed at rOpenSci’s 2018 unconf by (in alphabetical order):

dataspice's People

Contributors

Stargazers

Watchers

dataspice's Issues

Vectorise prep_attributes over all files in the data folder?

Could potentially add an argument to override by extracting column names from only specified paths or excluding specified paths?

Create ropensci blog post

Create blog post (on fork of ropensci/roweb2)
Announce on Slack
Fill in said blog post

shiny app for populating metadata files

I'm starting to work on a shiny app to take the generated metadata templates and make them such that a user who does not know R, can populate them with the needed details.

Find a way to make `contentURL` in `access.csv` be automatic

I'm not entirely sure this is trivial but I hope it is:

When access.csv gets filled in automatically, the contentURL field is blank. The actual URL to the file would follow the GitHub convention for serving raw files over HTTP:

https://raw.githubusercontent.com/{user|or}/{repo}/{branch}/path/to/file.ext

I think we know this or can find this out before we create the HTML. Take a shot at it and report back! Perhaps the git2r does this in a nice way.

Template-prompt/UI functions

Need some functions (shiny app) for prompting users of this package to describe their datasets in the appropriate metadata structure.

Drop `citation` field from biblio.csv

Carl and I realized the citation field from Schema.org is for an external citation, not "how to cite" the Dataset. Drop the file from the templates.

taxonomy notes

i added some things to taxizedb: install like remotes::install_github("ropensci/taxizedb@new-methods")

lowest_common - equiv of same fxn name in taxize, only works with ncbi only for now
taxid2vernacular - get vernacular names from taxonomic ids, works for ncbi only for now

build "time coverage" helper function

prompt user for their time column
should this allow for multiple time columns? (e.g., if their data is structured with before/after times measuring 'durations')

Decide on licensing

We need a license! What did we decide at the unconf?

allow updating, overwriting or trimming of prep_attributes

currently the prep_attributes() function just checks whether variableNames for a file have already been extracted and stops with an error if so. Would be good to offer options to:

overwrite any entries associated with a file.
append new rows for additional variableNames detected in a file.
trim any deprecated rows for variableNames not detected in a file anymore.

EML to dataspice formats

Hey dataspicers,

I wrote up a set of 4 functions (+ helpers) to convert EML to dataspice formats: https://github.com/isteves/emlspice. I'd be happy to merge it into dataspice if you all think it's appropriate. In any case, it would be great to get some feedback! @amoeba @cboettig

-Irene

make a workflow diagram for documentation

workflow:

create_spice(dir = "data")
human fills in templates via R code, edit csvs, or shiny app
write JSON file
build site

Do a metadata standard/methods review

It'd be great to have a survey of metadata standards for the user that is using our package and is new to metadata.

hex sticker needed

Also interested in talking to you! (Psych-DS)

Hi folks - I wanted to introduce a project I'm a part of, including sharing some work a few members have been doing with dataspice! We'd love to think about coordinating efforts with you.

Psych-DS is an in-progress technical specification for datasets in psychology, which uses Schema.org compliant metadata. We began this project as a conference hackathon this summer, and since then have been hashing out the specification and simultaneously coming to learn who all else is working on related projects (including Frictionless Data, and now thanks to your issue #71, DataCrate!). Eventually we hope Psych-DS will support specific subfields to converge on more standard ways of representing particular kinds of data, but have realized that getting social scientists on a shared technical footing & ensuring discoverability is a big deal! The project is heavily inspired by BIDS for neuroimaging data.

We are currently in the stage of wrapping up the draft of the written technical specification and beginning to 'road test' it on some real datasets. Beyond creating dataset-level metadata we are attempting to enforce some basic folder organization, file formatting/naming, and well-structured documentation of variables. (Here's a direct link to the long-form specification doc.)

At the same time, some intrepid coders started working on some standalone R-Shiny apps designed to produce this kind of output. In particular, Erin Buchanan (@doomlab) forked dataspice a while ago to try and tinker with it, and has been working with undergrads to write a tutorial that's appropriate for people who haven't used R before to approach a tool like dataspice.

Here is Erin's update: "We took the structure of dataspice – a grouping of shiny apps that one interacted with using RStudio – and converted it into one Shiny app that is published to the web for anyone to use. The app auto builds the CSV structured files from an uploaded dataset and then allows the user to step through entering the information for the access, attributes, bibliography, and creators files. Here, we fixed a couple typos and other issues that was prohibiting dataspice from being fully functioning and added more detailed instructions. The app still allows users to “write spice” and create a schema.org compliant JSON file and HTML report with some CSS tweaks."

Erin will have more details on the structure of the version she's been working on, and we'd love to talk about ways to join forces or coordinate. In particular, would it make sense to open some issues to merge back Erin's work, or is there a better way to start out than that? If it makes sense, we can put out a call to the Psych-DS mailing list to see if we have some extra hands to help.

Also, if anyone has time to take a look, we'd love to know if you have any feedback on how Psych-DS could work more smoothly with apps like dataspice, or if you see anything we haven't considered that might cause conflicts.

Thanks all!

Melissa & Erin

We should talk...

Hi,

I just became aware of this project, which looks very promising.

We have been working on a similar effort to package research data with schema.org json, also with an HTML file, the DataCrate spec. Looks like the JSON-LD we're producing with various tools (all of which are still in alpha) is quite similar to that here.

Anyway, I think we should look at aligning our efforts. Will anyone from this project be at the Research Object meeting on October 29 in Amsterdam? - I will.

Maybe write functions to set access|biblio|creators|attributes

We proposed three ways to author metadata:

Edit the metadata templates by hand (aka in Excel)
Edit with @aurielfournier 's rad Shiny apps
Use some R functions like set_access

We still need to write (3) I think. Ideally, the roxygen docs for each argument should give the user enough information to successfully fill the info in.

Write a Hugo thing to create a data website from JSON-LD

Thanks to #6

Odd attributes table behaviour in index.html when description and unitText empty.

When attributes.csv description and unitText empty, the index.html attributes table fields name and description are populated with the biblio.csv title and description fields. 🤔

Not had a chance to trace it. Will try to but just wanted to flag it.

Convert Schema JSON-LD into tabular formats?

Currently we have some nice functions (write_spice()) to go from tables -> json-ld. Seems like it would be handy to be able to easily reverse this process as well (e.g. someone gives you a bunch schema.org JSON-LD and you want it as tables where you can filter by geographic coverage and do automated unit conversions based on the measuredVariable metadata etc).

What would such a function be called? read_spice()? What would the return object be? (i.e. a list of data.frames? .csv files on disk? Something else?

Add support for geo coordinates and geopoints to website

blank cells filled with check boxes when editing .csv files

When a .csv file is partially filled via prep_* or edit_* functions, any remaining blank cells are check boxes when the file is opened again. New information cannot be added to cells.
This might be a bug in the shiny apps, but might be a problem with saving the .csv files. I haven't been able to pinpoint it.

Example workflow that should illustrate the problem:

create_spice(dir = "my_project")
prep_access(data_path = here("my_project"),
            access_path = here("my_project", "metadata", "access.csv"))
edit_access(metadata_dir = here("my_project", "metadata"))

Screenshot of problem:

Discussion: Can/should we be able to import a dataspiced dataset from the web/elsewhere?

This comes from a good question in my dataspice demo today: If user X authors a dataspice page for their dataset, and another scientist, Y, wants to use it, it'd be cool if they just ran:

import_spice("https://amoeba.github.io/some-dataset")

And their computer downloaded something like some-dataset.zip which had the dataspice.json and the files described in access.csv attached to it somehow.

edit_biblio adds empty row on save

Been playing around with the edit_biblio() app and noticed that when editing and saving a test biblio.csv, a new empty row is added to the file every time the table is saved in the app.

Not tested in other apps yet. Just want to log it.

Write a crosswalk table

https://docs.google.com/spreadsheets/d/1B5tPQ3HLxefKUJoP7-umd90-yOrRJhD1SriNMsDr_No/edit?usp=sharing

error: premature EOF

All works smoothly until build_site() throwing this error:

build_site(path = "hour_data_all/metadata")
Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
                                       
                     (right here) ------^

Any idea how to fix?

empty first column in `inst/example-dataset/StockInfo.csv`

Should probably correct it in the data file but it also might provide opportunity to think about some validation messages to prompt users to correct common errors in their datasets

unitText - What resource do we point users to?

When users are entering character strings about the units of their measured variables, what resource should we point them to, so that we ensure they are using accepted standards for how to describe those units (ha vs hectares).

do headers need to be in first row?

I think we are assuming that the data files have headers in the first row - do we want to support cases where that is not the case?

Multiple webpages for multiple datasets in the same repo?

Exciting new package 🎉

If I have multiple datasets in the same repository is it possible to build a webpage for each dataset? Or is this package only set up for the 1:1 dataset-to-repo use case?

drop attributes template down to 3 values

column name/value, unit text, description

Identify an example dataset to use to design against

use crayon to add colors to the prep_attributes message

There is a message at the end of prep_attributes to say which variables got added to the attributes table. Make it a nice message so it doesn't seem like an error

simple function to let user "opt in" to linking id's

if they have funders, "want to link to FundRef id's?"
if taxa are included in coverage, "want to auto-build higher taxa?"

("opting in" might be a funny way to put it if it's more of a "point users to a place where they can find their own id's for their own funders/taxa/etc if those are present in their dataset.")

Functions to produce data Roxygen documentation from dataspice metadata

It would be really great if we could produce Roxygen documentation for data from dataspice metadata

Discussion: Non-HTML output formats (like Rmd/md)

Had a great question in my dataspice demo at NCEAS today that dovetailed with something we had talked about at the unconf. What if we output to Rmd or md and the scientist could make use of that as they want. The Rmd/md could be converted to HTML still but the intermediate format would be of greater utility to the user.

Also had a suggestion for converting to Word or Google Docs in some form. The comment also pointed out that a lot of scientists (ours include) use Drive as a storage location over GitHub so being able to work with dataspice in a Google Drive workflow would be useful (making HTML output less useful).

data access not displayed in website

Finally testing out dataspice and it is awesome! I noticed that the distribution section isn't displayed in the built site, however. It would be nice if it did! I know this package isn't under super active development but I thought I'd just add the issue in case anyone has time/interest to have a look

Provide a function to export as EML XML?

A colleague wanted to convert their dataspice to an EML 2.1.1 XML doc today but we couldn't figure it out without a few lines of code:

json <- jsonlite::read_json("mydataspice.json")
eml <- emld::as_emld(json)
emld::as_xml(eml)

Do we have this already in dataspice and I just missed it? If not, would it be reasonable and useful to have a function to do this? We could include emld as a soft dependency and wrap the above code (or similar) in a single function.

Add argument to specify output path in build_site

Not being able to change where index.html is written out to (or indeed the name of the file) makes it problematic if users are using dataspice in a project where docs/ is being used for another purpose (eg hosting a site) where it would overwrite the site index.html.

Because of this, I suggest:

That we actually set the default path to data/metadata/index.html which feels much more tidy and would be accessible at a sensible URL if the repo were being served from the master branch on GitHub.
And an argument to be able to override the default.

Improve look and feel for the HTML

The HTML template I made during the unconf is real basic and not all that attractive. I consider making it look better a must-do.

@maelle had a great idea of using Blogdown or Hugo to generate a more complex site around the JSON-LD

I think it was @cboettig that thought we could automatically put some exploratory data viz type plots or [skimr](https://github.com/ropenscilabs/skimr) output on the page. This would be awesome! See also the cool work in codebook which does a really nice job of this already. Here's what skimr can do:

## Variable type: numeric 
##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100     hist
##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9 ▇▁▁▂▅▅▃▁
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5 ▇▁▁▅▃▃▂▂

At the very least, make the pages look as good as pkgdown's output. @cboettig mentioned showing a two-column display so the geo coverage and data access are at the top and less of the content is below the fold. I ❤️ ily agree.

editBiblio needs the data columns better defined

Right now I've defined them all as characters, that may not be the best option. Need to look into this more

Write a crosswalk transformation function

Based on #5

Define metadata template table structures

Define separate CSV/table structures for the internal metadata structures:

(These are just me typing, we haven't decided on these yet)

Bibliographic/citation (title (name), abstract (description)
Coverage (temporal, spatial, taxonomic)
Files (what do we describe here? Download URL, checksum/size?)
Attributes (name, unitText, unitCode, value, description)
Methods

Other things we might do:

Licensing (goes that go in Biblio? probably)
other ideas?

editAttritubes()

In readme.md editAttritubes() returns
Error in editAttritubes() : could not find function "editAttritubes"
I used edit_attributes() and shiny app is launched. Not sure if its a typo?

spec_tbl_df error

When I try to run the example in your README I get the following error after trying to run prep_attributes():

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
library(here)
#> here() starts at /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpmLF1Hr/reprex31138f9597b
library(dataspice)

create_spice()

data_files <- list.files(system.file("example-dataset/", 
                                     package = "dataspice"), 
                         pattern = ".csv",
                        full.names = TRUE)

attributes_path <- here::here("data", "metadata",
 "attributes.csv")

data_files %>% purrr::map(~prep_attributes(.x, attributes_path),
                         attributes_path = attributes_path)
#> The following variableNames have been added to the attributes file for fileName: BroodTables.csv
#> Stock.ID, Species, Stock, Ocean.Region, Region, Sub.Region, Jurisdiction, Lat, Lon, UseFlag, BroodYear, TotalEscapement, R0.1, R0.2, R0.3, R0.4, R0.5, R1.1, R1.2, R1.3, R1.4, R1.5, R2.1, R2.2, R2.3, R2.4, R3.1, R3.2, R3.3, R3.4, R4.1, R4.2, R4.3, TotalRecruits
#> 
#> Error: Can't combine `..1` <spec_tbl_df<
#>   fileName    : character
#>   variableName: character
#>   description : character
#>   unitText    : character
#> >> and `..2` <tbl_df<
#>   fileName    : character
#>   variableName: character
#>   description : character
#>   unitText    : character
#> >>.

^{Created on 2020-05-04 by the reprex package (v0.3.0)}

Also, it doesn't seem to have actually written anything to attributes.csv like it claims.

Idea: Integrate an easy HTML form <-> JSON lib into Shiny aps

The Shiny apps in #22 are super slick. I think they could be improved with a bit of validation. A common way I'd do this in a web app today would be to use something like react-jsonschema-form (HT @cboettig) to display a form to the user that dynamically updates an underlying JSON model and provides validation.

Definitely a backlog type Issue here but I think it'd be nifty.

Write create_spice()

create_spice(dir)

create metadata directory within dir
put templates there

Later, make it smarter and start filling in what we can

ropensci / dataspice Goto Github PK

dataspice's Introduction

dataspice

Example

Installation

Workflow

Create spice

Fill in templates

Helper functions

Shiny helper apps

Completed metadata files

Save JSON-LD file

Build website

Convert to EML

Convert from EML

Resources

Code of Conduct

Contributors

dataspice's People

Contributors

Stargazers

Watchers

Forkers

dataspice's Issues

Recommend Projects

Recommend Topics

Recommend Org