Coder Social home page Coder Social logo

ropensci / dataspice Goto Github PK

View Code? Open in Web Editor NEW
157.0 15.0 26.0 3.17 MB

:hot_pepper: Create lightweight schema.org descriptions of your datasets

Home Page: https://docs.ropensci.org/dataspice

License: Other

R 100.00%
unconf18 schema-org metadata data dataset r r-package rstats unconf

dataspice's Introduction

dataspice

CRAN Version CI Codecov test coverage

The goal of dataspice is to make it easier for researchers to create basic, lightweight, and concise metadata files for their datasets by editing the kind of files they’re probably most familiar with: CSVs. To spice up their data with a dash of metadata. These metadata files can then be used to:

  • Make useful information available during analysis.
  • Create a helpful dataset README webpage for your data similar to how pkgdown creates websites for R packages.
  • Produce more complex metadata formats for richer description of your datasets and to aid dataset discovery.

Metadata fields are based on Schema.org/Dataset and other metadata standards and represent a lowest common denominator which means converting between formats should be relatively straightforward.

Example

An basic example repository for demonstrating what using dataspice might look like can be found at https://github.com/amoeba/dataspice-example. From there, you can also check out a preview of the HTML dataspice generates at https://amoeba.github.io/dataspice-example and how Google sees it at https://search.google.com/test/rich-results?url=https%3A%2F%2Famoeba.github.io%2Fdataspice-example%2F.

A much more detailed example has been created by Anna Krystalli at https://annakrystalli.me/dataspice-tutorial/ (GitHub repo).

Installation

You can install the latest version from CRAN:

install.packages("dataspice")

Workflow

create_spice()
# Then fill in template CSV files, more on this below
write_spice()
build_site() # Optional

diagram showing a workflow for using dataspice

Create spice

create_spice() creates template metadata spreadsheets in a folder (by default created in the data folder in the current working directory).

The template files are:

  • biblio.csv - for title, abstract, spatial and temporal coverage, etc.
  • creators.csv - for data authors
  • attributes.csv - explains each of the variables in the dataset
  • access.csv - for files, file types, and download URLs (if appropriate)

Fill in templates

The user needs to fill in the details of the four template files. These csv files can be directly modified, or they can be edited using either the associated helper function and/or Shiny app.

Helper functions

  • prep_attributes() populates the fileName and variableName columns of the attributes.csv file using the header row of the data files.

  • prep_access() populates the fileName, name and encodingFormat columns of the access.csv file from the files in the folder containing the data.

To see an example of how prep_attributes() works, load the data files that ship with the package:

data_files <- list.files(system.file("example-dataset/", package = "dataspice"),
  pattern = ".csv",
  full.names = TRUE
)

This function assumes that the metadata templates are in a folder called metadata within a data folder.

attributes_path <- file.path("data", "metadata", "attributes.csv")

Using purrr::map(), this function can be applied over multiple files to populate the header names

data_files %>%
  purrr::map(~ prep_attributes(.x, attributes_path),
    attributes_path = attributes_path
  )

The output of prep_attributes() has the first two columns filled out:

fileName variableName description unitText
BroodTables.csv Stock.ID NA NA
BroodTables.csv Species NA NA
BroodTables.csv Stock NA NA
BroodTables.csv Ocean.Region NA NA
BroodTables.csv Region NA NA
BroodTables.csv Sub.Region NA NA

Shiny helper apps

Each of the metadata templates can be edited interactively using a Shiny app:

  • edit_attributes() opens a Shiny app that can be used to edit attributes.csv. The Shiny app displays the current attributes table and lets the user fill in an informative description and units (e.g. meters, hectares, etc.) for each variable.
  • edit_access() opens an editable version of access.csv
  • edit_creators() opens an editable version of creators.csv
  • edit_biblio() opens an editable version of biblio.csv

edit_attributes Shiny app

Remember to click on Save when finished editing.

Completed metadata files

The first few rows of the completed metadata tables in this example will look like this:

access.csv has one row for each file

fileName name contentUrl encodingFormat
StockInfo.csv StockInfo.csv NA CSV
BroodTables.csv BroodTables.csv NA CSV
SourceInfo.csv SourceInfo.csv NA CSV

attributes.csv has one row for each variable in each file

fileName variableName description unitText
BroodTables.csv Stock.ID Unique stock identifier NA
BroodTables.csv Species species of stock NA
BroodTables.csv Stock Stock name, generally river where stock is found NA
BroodTables.csv Ocean.Region Ocean region NA
BroodTables.csv Region Region of stock NA
BroodTables.csv Sub.Region Sub.Region of stock NA

biblio.csv is one row containing descriptors including spatial and temporal coverage

title description datePublished citation keywords license funder geographicDescription northBoundCoord eastBoundCoord southBoundCoord westBoundCoord wktString startDate endDate
Compiled annual statewide Alaskan salmon escapement counts, 1921-2017 The number of mature salmon migrating from the marine environment to freshwater streams is defined as escapement. Escapement data are the enumeration of these migrating fish as they pass upstream, … 2018-02-12 08:00:00 NA salmon, alaska, escapement NA NA NA 78 -131 47 -171 NA 1921-01-01 08:00:00 2017-01-01 08:00:00

creators.csv has one row for each of the dataset authors

id name affiliation email
NA Jeanette Clark National Center for Ecological Analysis and Synthesis [email protected]
NA Rich,Brenner Alaska Department of Fish and Game richard.brenner.alaska.gov

Save JSON-LD file

write_spice() generates a json-ld file (“linked data”) to aid in dataset discovery, creation of more extensive metadata (e.g. EML), and creating a website.

Here’s a view of the dataspice.json file of the example data:

listviewer pack output showing an example dataspice JSON file

Build website

  • build_site() creates a bare-bones index.html file in the repository docs folder with a simple view of the dataset with the metadata and an interactive map. For example, this repository results in this website

dataspice-website

Convert to EML

The metadata fields dataspice uses are based largely on their compatibility with terms from Schema.org. However, dataspice metadata can be converted to Ecological Metadata Language (EML), a much richer schema. The conversion isn’t perfect but dataspice will do its best to convert your dataspice metadata to EML:

library(dataspice)

# Load an example dataspice JSON that comes installed with the package
spice <- system.file(
  "examples", "annual-escapement.json",
  package = "dataspice"
)

# Convert it to EML
eml_doc <- spice_to_eml(spice)
#> Warning: variableMeasured not crosswalked to EML because we don't have enough
#> information. Use `crosswalk_variables` to create the start of an EML attributes
#> table. See ?crosswalk_variables for help.
#> You might want to run EML::eml_validate on the result at this point and fix what validations errors are produced. You will commonly need to set `packageId`, `system`, and provide `attributeList` elements for each `dataTable`.

You may receive warnings depending on which dataspice fields you filled in and this process will very likely produce an invalid EML record which is totally fine:

library(EML)
#> 
#> Attaching package: 'EML'
#> The following object is masked from 'package:magrittr':
#> 
#>     set_attributes

eml_validate(eml_doc)
#> [1] FALSE
#> attr(,"errors")
#> [1] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'packageId' is required but missing."                                  
#> [2] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'system' is required but missing."                                     
#> [3] "Element 'dataTable': Missing child element(s). Expected is one of ( physical, coverage, methods, additionalInfo, annotation, attributeList )."
#> [4] "Element 'dataTable': Missing child element(s). Expected is one of ( physical, coverage, methods, additionalInfo, annotation, attributeList )."
#> [5] "Element 'dataTable': Missing child element(s). Expected is one of ( physical, coverage, methods, additionalInfo, annotation, attributeList )."

This is because some fields in dataspice store information in different structures and because EML requires many fields that dataspice doesn’t have fields for. At this point, you should look over the validation errors produced by EML::eml_validate and fix those. Note that this will likely require familiarity with the EML Schema and the EML package.

Once you’re done, you can write out an EML XML file:

out_path <- tempfile()
write_eml(eml_doc, out_path)
#> NULL

Convert from EML

Like converting dataspice to EML, we can convert an existing EML record to a set of dataspice metadata tables which we can then work from within dataspice:

library(EML)

eml_path <- system.file("example-dataset/broodTable_metadata.xml", package = "dataspice")
eml <- read_eml(eml_path)
# Creates four CSVs files in the `data/metadata` directory
my_spice <- eml_to_spice(eml, "data/metadata")

Resources

A few existing tools & data standards to help users in specific domains:

…And others indexed in Fairsharing.org & the RDA metadata directory.

Code of Conduct

Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Contributors

This package was developed at rOpenSci’s 2018 unconf by (in alphabetical order):

dataspice's People

Contributors

amoeba avatar annakrystalli avatar aurielfournier avatar cboettig avatar ccamara avatar isteves avatar karawoo avatar kylehamilton avatar maelle avatar magpiedin avatar mattforshaw avatar njtierney avatar pakillo avatar robitalec avatar tdjames1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataspice's Issues

shiny app for populating metadata files

I'm starting to work on a shiny app to take the generated metadata templates and make them such that a user who does not know R, can populate them with the needed details.

Find a way to make `contentURL` in `access.csv` be automatic

I'm not entirely sure this is trivial but I hope it is:

When access.csv gets filled in automatically, the contentURL field is blank. The actual URL to the file would follow the GitHub convention for serving raw files over HTTP:

https://raw.githubusercontent.com/{user|or}/{repo}/{branch}/path/to/file.ext

I think we know this or can find this out before we create the HTML. Take a shot at it and report back! Perhaps the git2r does this in a nice way.

Template-prompt/UI functions

Need some functions (shiny app) for prompting users of this package to describe their datasets in the appropriate metadata structure.

Drop `citation` field from biblio.csv

Carl and I realized the citation field from Schema.org is for an external citation, not "how to cite" the Dataset. Drop the file from the templates.

taxonomy notes

i added some things to taxizedb: install like remotes::install_github("ropensci/taxizedb@new-methods")

  • lowest_common - equiv of same fxn name in taxize, only works with ncbi only for now
  • taxid2vernacular - get vernacular names from taxonomic ids, works for ncbi only for now

build "time coverage" helper function

  • prompt user for their time column
  • should this allow for multiple time columns? (e.g., if their data is structured with before/after times measuring 'durations')

allow updating, overwriting or trimming of prep_attributes

currently the prep_attributes() function just checks whether variableNames for a file have already been extracted and stops with an error if so. Would be good to offer options to:

  • overwrite any entries associated with a file.
  • append new rows for additional variableNames detected in a file.
  • trim any deprecated rows for variableNames not detected in a file anymore.

Also interested in talking to you! (Psych-DS)

Hi folks - I wanted to introduce a project I'm a part of, including sharing some work a few members have been doing with dataspice! We'd love to think about coordinating efforts with you.

Psych-DS is an in-progress technical specification for datasets in psychology, which uses Schema.org compliant metadata. We began this project as a conference hackathon this summer, and since then have been hashing out the specification and simultaneously coming to learn who all else is working on related projects (including Frictionless Data, and now thanks to your issue #71, DataCrate!). Eventually we hope Psych-DS will support specific subfields to converge on more standard ways of representing particular kinds of data, but have realized that getting social scientists on a shared technical footing & ensuring discoverability is a big deal! The project is heavily inspired by BIDS for neuroimaging data.

We are currently in the stage of wrapping up the draft of the written technical specification and beginning to 'road test' it on some real datasets. Beyond creating dataset-level metadata we are attempting to enforce some basic folder organization, file formatting/naming, and well-structured documentation of variables. (Here's a direct link to the long-form specification doc.)

At the same time, some intrepid coders started working on some standalone R-Shiny apps designed to produce this kind of output. In particular, Erin Buchanan (@doomlab) forked dataspice a while ago to try and tinker with it, and has been working with undergrads to write a tutorial that's appropriate for people who haven't used R before to approach a tool like dataspice.

Here is Erin's update: "We took the structure of dataspice – a grouping of shiny apps that one interacted with using RStudio – and converted it into one Shiny app that is published to the web for anyone to use. The app auto builds the CSV structured files from an uploaded dataset and then allows the user to step through entering the information for the access, attributes, bibliography, and creators files. Here, we fixed a couple typos and other issues that was prohibiting dataspice from being fully functioning and added more detailed instructions. The app still allows users to “write spice” and create a schema.org compliant JSON file and HTML report with some CSS tweaks."

Erin will have more details on the structure of the version she's been working on, and we'd love to talk about ways to join forces or coordinate. In particular, would it make sense to open some issues to merge back Erin's work, or is there a better way to start out than that? If it makes sense, we can put out a call to the Psych-DS mailing list to see if we have some extra hands to help.

Also, if anyone has time to take a look, we'd love to know if you have any feedback on how Psych-DS could work more smoothly with apps like dataspice, or if you see anything we haven't considered that might cause conflicts.

Thanks all!

  • Melissa & Erin

We should talk...

Hi,

I just became aware of this project, which looks very promising.

We have been working on a similar effort to package research data with schema.org json, also with an HTML file, the DataCrate spec. Looks like the JSON-LD we're producing with various tools (all of which are still in alpha) is quite similar to that here.

Anyway, I think we should look at aligning our efforts. Will anyone from this project be at the Research Object meeting on October 29 in Amsterdam? - I will.

Maybe write functions to set access|biblio|creators|attributes

We proposed three ways to author metadata:

  1. Edit the metadata templates by hand (aka in Excel)
  2. Edit with @aurielfournier 's rad Shiny apps
  3. Use some R functions like set_access

We still need to write (3) I think. Ideally, the roxygen docs for each argument should give the user enough information to successfully fill the info in.

Convert Schema JSON-LD into tabular formats?

Currently we have some nice functions (write_spice()) to go from tables -> json-ld. Seems like it would be handy to be able to easily reverse this process as well (e.g. someone gives you a bunch schema.org JSON-LD and you want it as tables where you can filter by geographic coverage and do automated unit conversions based on the measuredVariable metadata etc).

What would such a function be called? read_spice()? What would the return object be? (i.e. a list of data.frames? .csv files on disk? Something else?

blank cells filled with check boxes when editing .csv files

When a .csv file is partially filled via prep_* or edit_* functions, any remaining blank cells are check boxes when the file is opened again. New information cannot be added to cells.
This might be a bug in the shiny apps, but might be a problem with saving the .csv files. I haven't been able to pinpoint it.

Example workflow that should illustrate the problem:

create_spice(dir = "my_project")
prep_access(data_path = here("my_project"),
            access_path = here("my_project", "metadata", "access.csv"))
edit_access(metadata_dir = here("my_project", "metadata"))

Screenshot of problem:
Screenshot (13)

Discussion: Can/should we be able to import a dataspiced dataset from the web/elsewhere?

This comes from a good question in my dataspice demo today: If user X authors a dataspice page for their dataset, and another scientist, Y, wants to use it, it'd be cool if they just ran:

import_spice("https://amoeba.github.io/some-dataset")

And their computer downloaded something like some-dataset.zip which had the dataspice.json and the files described in access.csv attached to it somehow.

edit_biblio adds empty row on save

Been playing around with the edit_biblio() app and noticed that when editing and saving a test biblio.csv, a new empty row is added to the file every time the table is saved in the app.

Not tested in other apps yet. Just want to log it.

error: premature EOF

All works smoothly until build_site() throwing this error:

build_site(path = "hour_data_all/metadata")
Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
                                       
                     (right here) ------^

Any idea how to fix?

unitText - What resource do we point users to?

When users are entering character strings about the units of their measured variables, what resource should we point them to, so that we ensure they are using accepted standards for how to describe those units (ha vs hectares).

simple function to let user "opt in" to linking id's

  • if they have funders, "want to link to FundRef id's?"
  • if taxa are included in coverage, "want to auto-build higher taxa?"

("opting in" might be a funny way to put it if it's more of a "point users to a place where they can find their own id's for their own funders/taxa/etc if those are present in their dataset.")

Discussion: Non-HTML output formats (like Rmd/md)

Had a great question in my dataspice demo at NCEAS today that dovetailed with something we had talked about at the unconf. What if we output to Rmd or md and the scientist could make use of that as they want. The Rmd/md could be converted to HTML still but the intermediate format would be of greater utility to the user.

Also had a suggestion for converting to Word or Google Docs in some form. The comment also pointed out that a lot of scientists (ours include) use Drive as a storage location over GitHub so being able to work with dataspice in a Google Drive workflow would be useful (making HTML output less useful).

data access not displayed in website

Finally testing out dataspice and it is awesome! I noticed that the distribution section isn't displayed in the built site, however. It would be nice if it did! I know this package isn't under super active development but I thought I'd just add the issue in case anyone has time/interest to have a look

Provide a function to export as EML XML?

A colleague wanted to convert their dataspice to an EML 2.1.1 XML doc today but we couldn't figure it out without a few lines of code:

json <- jsonlite::read_json("mydataspice.json")
eml <- emld::as_emld(json)
emld::as_xml(eml)

Do we have this already in dataspice and I just missed it? If not, would it be reasonable and useful to have a function to do this? We could include emld as a soft dependency and wrap the above code (or similar) in a single function.

Add argument to specify output path in build_site

Not being able to change where index.html is written out to (or indeed the name of the file) makes it problematic if users are using dataspice in a project where docs/ is being used for another purpose (eg hosting a site) where it would overwrite the site index.html.

Because of this, I suggest:

  1. That we actually set the default path to data/metadata/index.html which feels much more tidy and would be accessible at a sensible URL if the repo were being served from the master branch on GitHub.

  2. And an argument to be able to override the default.

Improve look and feel for the HTML

The HTML template I made during the unconf is real basic and not all that attractive. I consider making it look better a must-do.

  • @maelle had a great idea of using Blogdown or Hugo to generate a more complex site around the JSON-LD

  • I think it was @cboettig that thought we could automatically put some exploratory data viz type plots or [skimr](https://github.com/ropenscilabs/skimr) output on the page. This would be awesome! See also the cool work in codebook which does a really nice job of this already. Here's what skimr can do:

    ## Variable type: numeric 
    ##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100     hist
    ##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9 ▇▁▁▂▅▅▃▁
    ##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5 ▇▁▁▅▃▃▂▂
    
  • At the very least, make the pages look as good as pkgdown's output. @cboettig mentioned showing a two-column display so the geo coverage and data access are at the top and less of the content is below the fold. I ❤️ ily agree.

Define metadata template table structures

Define separate CSV/table structures for the internal metadata structures:

(These are just me typing, we haven't decided on these yet)

  • Bibliographic/citation (title (name), abstract (description)
  • Coverage (temporal, spatial, taxonomic)
  • Files (what do we describe here? Download URL, checksum/size?)
  • Attributes (name, unitText, unitCode, value, description)
  • Methods

Other things we might do:

  • Licensing (goes that go in Biblio? probably)
  • other ideas?

editAttritubes()

In readme.md editAttritubes() returns
Error in editAttritubes() : could not find function "editAttritubes"
I used edit_attributes() and shiny app is launched. Not sure if its a typo?

spec_tbl_df error

When I try to run the example in your README I get the following error after trying to run prep_attributes():

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
library(here)
#> here() starts at /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpmLF1Hr/reprex31138f9597b
library(dataspice)

create_spice()

data_files <- list.files(system.file("example-dataset/", 
                                     package = "dataspice"), 
                         pattern = ".csv",
                        full.names = TRUE)

attributes_path <- here::here("data", "metadata",
 "attributes.csv")

data_files %>% purrr::map(~prep_attributes(.x, attributes_path),
                         attributes_path = attributes_path)
#> The following variableNames have been added to the attributes file for fileName: BroodTables.csv
#> Stock.ID, Species, Stock, Ocean.Region, Region, Sub.Region, Jurisdiction, Lat, Lon, UseFlag, BroodYear, TotalEscapement, R0.1, R0.2, R0.3, R0.4, R0.5, R1.1, R1.2, R1.3, R1.4, R1.5, R2.1, R2.2, R2.3, R2.4, R3.1, R3.2, R3.3, R3.4, R4.1, R4.2, R4.3, TotalRecruits
#> 
#> Error: Can't combine `..1` <spec_tbl_df<
#>   fileName    : character
#>   variableName: character
#>   description : character
#>   unitText    : character
#> >> and `..2` <tbl_df<
#>   fileName    : character
#>   variableName: character
#>   description : character
#>   unitText    : character
#> >>.

Created on 2020-05-04 by the reprex package (v0.3.0)

Also, it doesn't seem to have actually written anything to attributes.csv like it claims.

Write create_spice()

create_spice(dir)

  • create metadata directory within dir
  • put templates there

Later, make it smarter and start filling in what we can

Data-to-metadata conversion functions

Need some functions for auto-extracting data from dataset itself

(Not to be confused with functions for converting other metadata [e.g., stuff written up by a researcher/not derived from the dataset itself] to standardly-structured-metadata)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.