ropensci / dataspice Goto Github PK
View Code? Open in Web Editor NEW:hot_pepper: Create lightweight schema.org descriptions of your datasets
Home Page: https://docs.ropensci.org/dataspice
License: Other
:hot_pepper: Create lightweight schema.org descriptions of your datasets
Home Page: https://docs.ropensci.org/dataspice
License: Other
Finally testing out dataspice
and it is awesome! I noticed that the distribution section isn't displayed in the built site, however. It would be nice if it did! I know this package isn't under super active development but I thought I'd just add the issue in case anyone has time/interest to have a look
Right now I've defined them all as characters, that may not be the best option. Need to look into this more
Had a great question in my dataspice demo at NCEAS today that dovetailed with something we had talked about at the unconf. What if we output to Rmd or md and the scientist could make use of that as they want. The Rmd/md could be converted to HTML still but the intermediate format would be of greater utility to the user.
Also had a suggestion for converting to Word or Google Docs in some form. The comment also pointed out that a lot of scientists (ours include) use Drive as a storage location over GitHub so being able to work with dataspice
in a Google Drive workflow would be useful (making HTML output less useful).
workflow:
create_spice(dir = "data")
There is a message at the end of prep_attributes to say which variables got added to the attributes table. Make it a nice message so it doesn't seem like an error
In readme.md
editAttritubes() returns
Error in editAttritubes() : could not find function "editAttritubes"
I used edit_attributes() and shiny app is launched. Not sure if its a typo?
Need some functions (shiny app) for prompting users of this package to describe their datasets in the appropriate metadata structure.
Need some functions for auto-extracting data from dataset itself
(Not to be confused with functions for converting other metadata [e.g., stuff written up by a researcher/not derived from the dataset itself] to standardly-structured-metadata)
We need a license! What did we decide at the unconf?
When attributes.csv
description
and unitText
empty, the index.html attributes table fields name
and description
are populated with the biblio.csv
title
and description
fields. π€
Not had a chance to trace it. Will try to but just wanted to flag it.
i added some things to taxizedb
: install like remotes::install_github("ropensci/taxizedb@new-methods")
lowest_common
- equiv of same fxn name in taxize
, only works with ncbi only for nowtaxid2vernacular
- get vernacular names from taxonomic ids, works for ncbi only for nowCarl and I realized the citation
field from Schema.org is for an external citation, not "how to cite" the Dataset. Drop the file from the templates.
Not being able to change where index.html
is written out to (or indeed the name of the file) makes it problematic if users are using dataspice in a project where docs/
is being used for another purpose (eg hosting a site) where it would overwrite the site index.html
.
Because of this, I suggest:
That we actually set the default path to data/metadata/index.html
which feels much more tidy and would be accessible at a sensible URL if the repo were being served from the master branch on GitHub.
And an argument to be able to override the default.
Hey dataspicers,
I wrote up a set of 4 functions (+ helpers) to convert EML to dataspice formats: https://github.com/isteves/emlspice. I'd be happy to merge it into dataspice if you all think it's appropriate. In any case, it would be great to get some feedback! @amoeba @cboettig
-Irene
We proposed three ways to author metadata:
set_access
We still need to write (3) I think. Ideally, the roxygen docs for each argument should give the user enough information to successfully fill the info in.
All works smoothly until build_site()
throwing this error:
build_site(path = "hour_data_all/metadata")
Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
(right here) ------^
Any idea how to fix?
When users are entering character strings about the units of their measured variables, what resource should we point them to, so that we ensure they are using accepted standards for how to describe those units (ha vs hectares).
Currently we have some nice functions (write_spice()
) to go from tables -> json-ld. Seems like it would be handy to be able to easily reverse this process as well (e.g. someone gives you a bunch schema.org JSON-LD and you want it as tables where you can filter by geographic coverage and do automated unit conversions based on the measuredVariable
metadata etc).
What would such a function be called? read_spice()
? What would the return object be? (i.e. a list of data.frames
? .csv
files on disk? Something else?
create_spice(dir)
dir
Later, make it smarter and start filling in what we can
The HTML template I made during the unconf is real basic and not all that attractive. I consider making it look better a must-do.
@maelle had a great idea of using Blogdown or Hugo to generate a more complex site around the JSON-LD
I think it was @cboettig that thought we could automatically put some exploratory data viz type plots or [skimr](https://github.com/ropenscilabs/skimr)
output on the page. This would be awesome! See also the cool work in codebook which does a really nice job of this already. Here's what skimr can do:
## Variable type: numeric
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 βββββ
β
ββ
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ββββ
ββββ
At the very least, make the pages look as good as pkgdown's output. @cboettig mentioned showing a two-column display so the geo coverage and data access are at the top and less of the content is below the fold. I β€οΈ ily agree.
I'm not entirely sure this is trivial but I hope it is:
When access.csv
gets filled in automatically, the contentURL
field is blank. The actual URL to the file would follow the GitHub convention for serving raw files over HTTP:
https://raw.githubusercontent.com/{user|or}/{repo}/{branch}/path/to/file.ext
I think we know this or can find this out before we create the HTML. Take a shot at it and report back! Perhaps the git2r does this in a nice way.
When a .csv file is partially filled via prep_* or edit_* functions, any remaining blank cells are check boxes when the file is opened again. New information cannot be added to cells.
This might be a bug in the shiny apps, but might be a problem with saving the .csv files. I haven't been able to pinpoint it.
Example workflow that should illustrate the problem:
create_spice(dir = "my_project")
prep_access(data_path = here("my_project"),
access_path = here("my_project", "metadata", "access.csv"))
edit_access(metadata_dir = here("my_project", "metadata"))
Been playing around with the edit_biblio()
app and noticed that when editing and saving a test biblio.csv
, a new empty row is added to the file every time the table is saved in the app.
Not tested in other apps yet. Just want to log it.
Hi,
I just became aware of this project, which looks very promising.
We have been working on a similar effort to package research data with schema.org json, also with an HTML file, the DataCrate spec. Looks like the JSON-LD we're producing with various tools (all of which are still in alpha) is quite similar to that here.
Anyway, I think we should look at aligning our efforts. Will anyone from this project be at the Research Object meeting on October 29 in Amsterdam? - I will.
I think we are assuming that the data files have headers in the first row - do we want to support cases where that is not the case?
Define separate CSV/table structures for the internal metadata structures:
(These are just me typing, we haven't decided on these yet)
Other things we might do:
Hi folks - I wanted to introduce a project I'm a part of, including sharing some work a few members have been doing with dataspice! We'd love to think about coordinating efforts with you.
Psych-DS is an in-progress technical specification for datasets in psychology, which uses Schema.org compliant metadata. We began this project as a conference hackathon this summer, and since then have been hashing out the specification and simultaneously coming to learn who all else is working on related projects (including Frictionless Data, and now thanks to your issue #71, DataCrate!). Eventually we hope Psych-DS will support specific subfields to converge on more standard ways of representing particular kinds of data, but have realized that getting social scientists on a shared technical footing & ensuring discoverability is a big deal! The project is heavily inspired by BIDS for neuroimaging data.
We are currently in the stage of wrapping up the draft of the written technical specification and beginning to 'road test' it on some real datasets. Beyond creating dataset-level metadata we are attempting to enforce some basic folder organization, file formatting/naming, and well-structured documentation of variables. (Here's a direct link to the long-form specification doc.)
At the same time, some intrepid coders started working on some standalone R-Shiny apps designed to produce this kind of output. In particular, Erin Buchanan (@doomlab) forked dataspice a while ago to try and tinker with it, and has been working with undergrads to write a tutorial that's appropriate for people who haven't used R before to approach a tool like dataspice.
Here is Erin's update: "We took the structure of dataspice β a grouping of shiny apps that one interacted with using RStudio β and converted it into one Shiny app that is published to the web for anyone to use. The app auto builds the CSV structured files from an uploaded dataset and then allows the user to step through entering the information for the access, attributes, bibliography, and creators files. Here, we fixed a couple typos and other issues that was prohibiting dataspice from being fully functioning and added more detailed instructions. The app still allows users to βwrite spiceβ and create a schema.org compliant JSON file and HTML report with some CSS tweaks."
Erin will have more details on the structure of the version she's been working on, and we'd love to talk about ways to join forces or coordinate. In particular, would it make sense to open some issues to merge back Erin's work, or is there a better way to start out than that? If it makes sense, we can put out a call to the Psych-DS mailing list to see if we have some extra hands to help.
Also, if anyone has time to take a look, we'd love to know if you have any feedback on how Psych-DS could work more smoothly with apps like dataspice, or if you see anything we haven't considered that might cause conflicts.
Thanks all!
currently the prep_attributes()
function just checks whether variableNames for a file have already been extracted and stops with an error if so. Would be good to offer options to:
When I try to run the example in your README I get the following error after trying to run prep_attributes()
:
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
library(here)
#> here() starts at /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpmLF1Hr/reprex31138f9597b
library(dataspice)
create_spice()
data_files <- list.files(system.file("example-dataset/",
package = "dataspice"),
pattern = ".csv",
full.names = TRUE)
attributes_path <- here::here("data", "metadata",
"attributes.csv")
data_files %>% purrr::map(~prep_attributes(.x, attributes_path),
attributes_path = attributes_path)
#> The following variableNames have been added to the attributes file for fileName: BroodTables.csv
#> Stock.ID, Species, Stock, Ocean.Region, Region, Sub.Region, Jurisdiction, Lat, Lon, UseFlag, BroodYear, TotalEscapement, R0.1, R0.2, R0.3, R0.4, R0.5, R1.1, R1.2, R1.3, R1.4, R1.5, R2.1, R2.2, R2.3, R2.4, R3.1, R3.2, R3.3, R3.4, R4.1, R4.2, R4.3, TotalRecruits
#>
#> Error: Can't combine `..1` <spec_tbl_df<
#> fileName : character
#> variableName: character
#> description : character
#> unitText : character
#> >> and `..2` <tbl_df<
#> fileName : character
#> variableName: character
#> description : character
#> unitText : character
#> >>.
Created on 2020-05-04 by the reprex package (v0.3.0)
Also, it doesn't seem to have actually written anything to attributes.csv like it claims.
Thanks to #6
("opting in" might be a funny way to put it if it's more of a "point users to a place where they can find their own id's for their own funders/taxa/etc if those are present in their dataset.")
Based on #5
This comes from a good question in my dataspice
demo today: If user X authors a dataspice
page for their dataset, and another scientist, Y, wants to use it, it'd be cool if they just ran:
import_spice("https://amoeba.github.io/some-dataset")
And their computer downloaded something like some-dataset.zip
which had the dataspice.json
and the files described in access.csv
attached to it somehow.
I'm starting to work on a shiny app to take the generated metadata templates and make them such that a user who does not know R, can populate them with the needed details.
A colleague wanted to convert their dataspice to an EML 2.1.1 XML doc today but we couldn't figure it out without a few lines of code:
json <- jsonlite::read_json("mydataspice.json")
eml <- emld::as_emld(json)
emld::as_xml(eml)
Do we have this already in dataspice and I just missed it? If not, would it be reasonable and useful to have a function to do this? We could include emld
as a soft dependency and wrap the above code (or similar) in a single function.
It'd be great to have a survey of metadata standards for the user that is using our package and is new to metadata.
Exciting new package π
If I have multiple datasets in the same repository is it possible to build a webpage for each dataset? Or is this package only set up for the 1:1 dataset-to-repo use case?
Should probably correct it in the data file but it also might provide opportunity to think about some validation messages to prompt users to correct common errors in their datasets
column name/value, unit text, description
It would be really great if we could produce Roxygen documentation for data from dataspice
metadata
The Shiny apps in #22 are super slick. I think they could be improved with a bit of validation. A common way I'd do this in a web app today would be to use something like react-jsonschema-form (HT @cboettig) to display a form to the user that dynamically updates an underlying JSON model and provides validation.
Definitely a backlog type Issue here but I think it'd be nifty.
Could potentially add an argument to override by extracting column names from only specified paths or excluding specified paths?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.