ropensci / dataspice Goto Github PK

View Code? Open in Web Editor NEW

157.0 15.0 26.0 3.17 MB

:hot_pepper: Create lightweight schema.org descriptions of your datasets

Home Page: https://docs.ropensci.org/dataspice

License: Other

R 100.00%

unconf18 schema-org metadata data dataset r r-package rstats unconf

dataspice's Issues

data access not displayed in website

Finally testing out dataspice and it is awesome! I noticed that the distribution section isn't displayed in the built site, however. It would be nice if it did! I know this package isn't under super active development but I thought I'd just add the issue in case anyone has time/interest to have a look

editBiblio needs the data columns better defined

Right now I've defined them all as characters, that may not be the best option. Need to look into this more

build "time coverage" helper function

prompt user for their time column
should this allow for multiple time columns? (e.g., if their data is structured with before/after times measuring 'durations')

Discussion: Non-HTML output formats (like Rmd/md)

Had a great question in my dataspice demo at NCEAS today that dovetailed with something we had talked about at the unconf. What if we output to Rmd or md and the scientist could make use of that as they want. The Rmd/md could be converted to HTML still but the intermediate format would be of greater utility to the user.

Also had a suggestion for converting to Word or Google Docs in some form. The comment also pointed out that a lot of scientists (ours include) use Drive as a storage location over GitHub so being able to work with dataspice in a Google Drive workflow would be useful (making HTML output less useful).

make a workflow diagram for documentation

workflow:

create_spice(dir = "data")
human fills in templates via R code, edit csvs, or shiny app
write JSON file
build site

use crayon to add colors to the prep_attributes message

There is a message at the end of prep_attributes to say which variables got added to the attributes table. Make it a nice message so it doesn't seem like an error

editAttritubes()

In readme.md editAttritubes() returns
Error in editAttritubes() : could not find function "editAttritubes"
I used edit_attributes() and shiny app is launched. Not sure if its a typo?

Template-prompt/UI functions

Need some functions (shiny app) for prompting users of this package to describe their datasets in the appropriate metadata structure.

Data-to-metadata conversion functions

Need some functions for auto-extracting data from dataset itself

(Not to be confused with functions for converting other metadata [e.g., stuff written up by a researcher/not derived from the dataset itself] to standardly-structured-metadata)

Decide on licensing

We need a license! What did we decide at the unconf?

Odd attributes table behaviour in index.html when description and unitText empty.

When attributes.csv description and unitText empty, the index.html attributes table fields name and description are populated with the biblio.csv title and description fields. 🤔

Not had a chance to trace it. Will try to but just wanted to flag it.

Write build_site()

taxonomy notes

i added some things to taxizedb: install like remotes::install_github("ropensci/taxizedb@new-methods")

lowest_common - equiv of same fxn name in taxize, only works with ncbi only for now
taxid2vernacular - get vernacular names from taxonomic ids, works for ncbi only for now

Identify an example dataset to use to design against

Drop `citation` field from biblio.csv

Carl and I realized the citation field from Schema.org is for an external citation, not "how to cite" the Dataset. Drop the file from the templates.

Add argument to specify output path in build_site

Not being able to change where index.html is written out to (or indeed the name of the file) makes it problematic if users are using dataspice in a project where docs/ is being used for another purpose (eg hosting a site) where it would overwrite the site index.html.

Because of this, I suggest:

That we actually set the default path to data/metadata/index.html which feels much more tidy and would be accessible at a sensible URL if the repo were being served from the master branch on GitHub.
And an argument to be able to override the default.

EML to dataspice formats

Hey dataspicers,

I wrote up a set of 4 functions (+ helpers) to convert EML to dataspice formats: https://github.com/isteves/emlspice. I'd be happy to merge it into dataspice if you all think it's appropriate. In any case, it would be great to get some feedback! @amoeba @cboettig

-Irene

Maybe write functions to set access|biblio|creators|attributes

We proposed three ways to author metadata:

Edit the metadata templates by hand (aka in Excel)
Edit with @aurielfournier 's rad Shiny apps
Use some R functions like set_access

We still need to write (3) I think. Ideally, the roxygen docs for each argument should give the user enough information to successfully fill the info in.

error: premature EOF

All works smoothly until build_site() throwing this error:

build_site(path = "hour_data_all/metadata")
Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
                                       
                     (right here) ------^

Any idea how to fix?

unitText - What resource do we point users to?

When users are entering character strings about the units of their measured variables, what resource should we point them to, so that we ensure they are using accepted standards for how to describe those units (ha vs hectares).

Convert Schema JSON-LD into tabular formats?

Currently we have some nice functions (write_spice()) to go from tables -> json-ld. Seems like it would be handy to be able to easily reverse this process as well (e.g. someone gives you a bunch schema.org JSON-LD and you want it as tables where you can filter by geographic coverage and do automated unit conversions based on the measuredVariable metadata etc).

What would such a function be called? read_spice()? What would the return object be? (i.e. a list of data.frames? .csv files on disk? Something else?

Add support for geo coordinates and geopoints to website

Write create_spice()

create_spice(dir)

create metadata directory within dir
put templates there

Later, make it smarter and start filling in what we can

Improve look and feel for the HTML

The HTML template I made during the unconf is real basic and not all that attractive. I consider making it look better a must-do.

@maelle had a great idea of using Blogdown or Hugo to generate a more complex site around the JSON-LD

I think it was @cboettig that thought we could automatically put some exploratory data viz type plots or [skimr](https://github.com/ropenscilabs/skimr) output on the page. This would be awesome! See also the cool work in codebook which does a really nice job of this already. Here's what skimr can do:

## Variable type: numeric 
##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100     hist
##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9 ▇▁▁▂▅▅▃▁
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5 ▇▁▁▅▃▃▂▂

At the very least, make the pages look as good as pkgdown's output. @cboettig mentioned showing a two-column display so the geo coverage and data access are at the top and less of the content is below the fold. I ❤️ ily agree.

Find a way to make `contentURL` in `access.csv` be automatic

I'm not entirely sure this is trivial but I hope it is:

When access.csv gets filled in automatically, the contentURL field is blank. The actual URL to the file would follow the GitHub convention for serving raw files over HTTP:

https://raw.githubusercontent.com/{user|or}/{repo}/{branch}/path/to/file.ext

I think we know this or can find this out before we create the HTML. Take a shot at it and report back! Perhaps the git2r does this in a nice way.

blank cells filled with check boxes when editing .csv files

When a .csv file is partially filled via prep_* or edit_* functions, any remaining blank cells are check boxes when the file is opened again. New information cannot be added to cells.
This might be a bug in the shiny apps, but might be a problem with saving the .csv files. I haven't been able to pinpoint it.

Example workflow that should illustrate the problem:

create_spice(dir = "my_project")
prep_access(data_path = here("my_project"),
            access_path = here("my_project", "metadata", "access.csv"))
edit_access(metadata_dir = here("my_project", "metadata"))

Screenshot of problem:

edit_biblio adds empty row on save

Been playing around with the edit_biblio() app and noticed that when editing and saving a test biblio.csv, a new empty row is added to the file every time the table is saved in the app.

Not tested in other apps yet. Just want to log it.

We should talk...

Hi,

I just became aware of this project, which looks very promising.

We have been working on a similar effort to package research data with schema.org json, also with an HTML file, the DataCrate spec. Looks like the JSON-LD we're producing with various tools (all of which are still in alpha) is quite similar to that here.

Anyway, I think we should look at aligning our efforts. Will anyone from this project be at the Research Object meeting on October 29 in Amsterdam? - I will.

do headers need to be in first row?

I think we are assuming that the data files have headers in the first row - do we want to support cases where that is not the case?

Define metadata template table structures

Define separate CSV/table structures for the internal metadata structures:

(These are just me typing, we haven't decided on these yet)

Bibliographic/citation (title (name), abstract (description)
Coverage (temporal, spatial, taxonomic)
Files (what do we describe here? Download URL, checksum/size?)
Attributes (name, unitText, unitCode, value, description)
Methods

Other things we might do:

Licensing (goes that go in Biblio? probably)
other ideas?

Write a crosswalk table

https://docs.google.com/spreadsheets/d/1B5tPQ3HLxefKUJoP7-umd90-yOrRJhD1SriNMsDr_No/edit?usp=sharing

Also interested in talking to you! (Psych-DS)

Hi folks - I wanted to introduce a project I'm a part of, including sharing some work a few members have been doing with dataspice! We'd love to think about coordinating efforts with you.

Psych-DS is an in-progress technical specification for datasets in psychology, which uses Schema.org compliant metadata. We began this project as a conference hackathon this summer, and since then have been hashing out the specification and simultaneously coming to learn who all else is working on related projects (including Frictionless Data, and now thanks to your issue #71, DataCrate!). Eventually we hope Psych-DS will support specific subfields to converge on more standard ways of representing particular kinds of data, but have realized that getting social scientists on a shared technical footing & ensuring discoverability is a big deal! The project is heavily inspired by BIDS for neuroimaging data.

We are currently in the stage of wrapping up the draft of the written technical specification and beginning to 'road test' it on some real datasets. Beyond creating dataset-level metadata we are attempting to enforce some basic folder organization, file formatting/naming, and well-structured documentation of variables. (Here's a direct link to the long-form specification doc.)

At the same time, some intrepid coders started working on some standalone R-Shiny apps designed to produce this kind of output. In particular, Erin Buchanan (@doomlab) forked dataspice a while ago to try and tinker with it, and has been working with undergrads to write a tutorial that's appropriate for people who haven't used R before to approach a tool like dataspice.

Here is Erin's update: "We took the structure of dataspice – a grouping of shiny apps that one interacted with using RStudio – and converted it into one Shiny app that is published to the web for anyone to use. The app auto builds the CSV structured files from an uploaded dataset and then allows the user to step through entering the information for the access, attributes, bibliography, and creators files. Here, we fixed a couple typos and other issues that was prohibiting dataspice from being fully functioning and added more detailed instructions. The app still allows users to “write spice” and create a schema.org compliant JSON file and HTML report with some CSS tweaks."

Erin will have more details on the structure of the version she's been working on, and we'd love to talk about ways to join forces or coordinate. In particular, would it make sense to open some issues to merge back Erin's work, or is there a better way to start out than that? If it makes sense, we can put out a call to the Psych-DS mailing list to see if we have some extra hands to help.

Also, if anyone has time to take a look, we'd love to know if you have any feedback on how Psych-DS could work more smoothly with apps like dataspice, or if you see anything we haven't considered that might cause conflicts.

Thanks all!

Melissa & Erin

allow updating, overwriting or trimming of prep_attributes

currently the prep_attributes() function just checks whether variableNames for a file have already been extracted and stops with an error if so. Would be good to offer options to:

overwrite any entries associated with a file.
append new rows for additional variableNames detected in a file.
trim any deprecated rows for variableNames not detected in a file anymore.

Create ropensci blog post

Create blog post (on fork of ropensci/roweb2)
Announce on Slack
Fill in said blog post

spec_tbl_df error

When I try to run the example in your README I get the following error after trying to run prep_attributes():

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
library(here)
#> here() starts at /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpmLF1Hr/reprex31138f9597b
library(dataspice)

create_spice()

data_files <- list.files(system.file("example-dataset/", 
                                     package = "dataspice"), 
                         pattern = ".csv",
                        full.names = TRUE)

attributes_path <- here::here("data", "metadata",
 "attributes.csv")

data_files %>% purrr::map(~prep_attributes(.x, attributes_path),
                         attributes_path = attributes_path)
#> The following variableNames have been added to the attributes file for fileName: BroodTables.csv
#> Stock.ID, Species, Stock, Ocean.Region, Region, Sub.Region, Jurisdiction, Lat, Lon, UseFlag, BroodYear, TotalEscapement, R0.1, R0.2, R0.3, R0.4, R0.5, R1.1, R1.2, R1.3, R1.4, R1.5, R2.1, R2.2, R2.3, R2.4, R3.1, R3.2, R3.3, R3.4, R4.1, R4.2, R4.3, TotalRecruits
#> 
#> Error: Can't combine `..1` <spec_tbl_df<
#>   fileName    : character
#>   variableName: character
#>   description : character
#>   unitText    : character
#> >> and `..2` <tbl_df<
#>   fileName    : character
#>   variableName: character
#>   description : character
#>   unitText    : character
#> >>.

^{Created on 2020-05-04 by the reprex package (v0.3.0)}

Also, it doesn't seem to have actually written anything to attributes.csv like it claims.

Write a Hugo thing to create a data website from JSON-LD

Thanks to #6

hex sticker needed

simple function to let user "opt in" to linking id's

if they have funders, "want to link to FundRef id's?"
if taxa are included in coverage, "want to auto-build higher taxa?"

("opting in" might be a funny way to put it if it's more of a "point users to a place where they can find their own id's for their own funders/taxa/etc if those are present in their dataset.")

Write a crosswalk transformation function

Based on #5

Discussion: Can/should we be able to import a dataspiced dataset from the web/elsewhere?

This comes from a good question in my dataspice demo today: If user X authors a dataspice page for their dataset, and another scientist, Y, wants to use it, it'd be cool if they just ran:

import_spice("https://amoeba.github.io/some-dataset")

And their computer downloaded something like some-dataset.zip which had the dataspice.json and the files described in access.csv attached to it somehow.

shiny app for populating metadata files

I'm starting to work on a shiny app to take the generated metadata templates and make them such that a user who does not know R, can populate them with the needed details.

Provide a function to export as EML XML?

A colleague wanted to convert their dataspice to an EML 2.1.1 XML doc today but we couldn't figure it out without a few lines of code:

json <- jsonlite::read_json("mydataspice.json")
eml <- emld::as_emld(json)
emld::as_xml(eml)

Do we have this already in dataspice and I just missed it? If not, would it be reasonable and useful to have a function to do this? We could include emld as a soft dependency and wrap the above code (or similar) in a single function.

Do a metadata standard/methods review

It'd be great to have a survey of metadata standards for the user that is using our package and is new to metadata.

Multiple webpages for multiple datasets in the same repo?

Exciting new package 🎉

If I have multiple datasets in the same repository is it possible to build a webpage for each dataset? Or is this package only set up for the 1:1 dataset-to-repo use case?

empty first column in `inst/example-dataset/StockInfo.csv`

Should probably correct it in the data file but it also might provide opportunity to think about some validation messages to prompt users to correct common errors in their datasets

Create demo repo for dataspice

drop attributes template down to 3 values

column name/value, unit text, description

Functions to produce data Roxygen documentation from dataspice metadata

It would be really great if we could produce Roxygen documentation for data from dataspice metadata

Idea: Integrate an easy HTML form <-> JSON lib into Shiny aps

The Shiny apps in #22 are super slick. I think they could be improved with a bit of validation. A common way I'd do this in a web app today would be to use something like react-jsonschema-form (HT @cboettig) to display a form to the user that dynamically updates an underlying JSON model and provides validation.

Definitely a backlog type Issue here but I think it'd be nifty.

Vectorise prep_attributes over all files in the data folder?

Could potentially add an argument to override by extracting column names from only specified paths or excluding specified paths?

ropensci / dataspice Goto Github PK

dataspice's Issues

Recommend Projects

Recommend Topics

Recommend Org