Coder Social home page Coder Social logo

ropensci / dataspice Goto Github PK

View Code? Open in Web Editor NEW
157.0 15.0 26.0 3.17 MB

:hot_pepper: Create lightweight schema.org descriptions of your datasets

Home Page: https://docs.ropensci.org/dataspice

License: Other

R 100.00%
unconf18 schema-org metadata data dataset r r-package rstats unconf

dataspice's Issues

data access not displayed in website

Finally testing out dataspice and it is awesome! I noticed that the distribution section isn't displayed in the built site, however. It would be nice if it did! I know this package isn't under super active development but I thought I'd just add the issue in case anyone has time/interest to have a look

build "time coverage" helper function

  • prompt user for their time column
  • should this allow for multiple time columns? (e.g., if their data is structured with before/after times measuring 'durations')

Discussion: Non-HTML output formats (like Rmd/md)

Had a great question in my dataspice demo at NCEAS today that dovetailed with something we had talked about at the unconf. What if we output to Rmd or md and the scientist could make use of that as they want. The Rmd/md could be converted to HTML still but the intermediate format would be of greater utility to the user.

Also had a suggestion for converting to Word or Google Docs in some form. The comment also pointed out that a lot of scientists (ours include) use Drive as a storage location over GitHub so being able to work with dataspice in a Google Drive workflow would be useful (making HTML output less useful).

editAttritubes()

In readme.md editAttritubes() returns
Error in editAttritubes() : could not find function "editAttritubes"
I used edit_attributes() and shiny app is launched. Not sure if its a typo?

Template-prompt/UI functions

Need some functions (shiny app) for prompting users of this package to describe their datasets in the appropriate metadata structure.

Data-to-metadata conversion functions

Need some functions for auto-extracting data from dataset itself

(Not to be confused with functions for converting other metadata [e.g., stuff written up by a researcher/not derived from the dataset itself] to standardly-structured-metadata)

taxonomy notes

i added some things to taxizedb: install like remotes::install_github("ropensci/taxizedb@new-methods")

  • lowest_common - equiv of same fxn name in taxize, only works with ncbi only for now
  • taxid2vernacular - get vernacular names from taxonomic ids, works for ncbi only for now

Drop `citation` field from biblio.csv

Carl and I realized the citation field from Schema.org is for an external citation, not "how to cite" the Dataset. Drop the file from the templates.

Add argument to specify output path in build_site

Not being able to change where index.html is written out to (or indeed the name of the file) makes it problematic if users are using dataspice in a project where docs/ is being used for another purpose (eg hosting a site) where it would overwrite the site index.html.

Because of this, I suggest:

  1. That we actually set the default path to data/metadata/index.html which feels much more tidy and would be accessible at a sensible URL if the repo were being served from the master branch on GitHub.

  2. And an argument to be able to override the default.

Maybe write functions to set access|biblio|creators|attributes

We proposed three ways to author metadata:

  1. Edit the metadata templates by hand (aka in Excel)
  2. Edit with @aurielfournier 's rad Shiny apps
  3. Use some R functions like set_access

We still need to write (3) I think. Ideally, the roxygen docs for each argument should give the user enough information to successfully fill the info in.

error: premature EOF

All works smoothly until build_site() throwing this error:

build_site(path = "hour_data_all/metadata")
Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
                                       
                     (right here) ------^

Any idea how to fix?

unitText - What resource do we point users to?

When users are entering character strings about the units of their measured variables, what resource should we point them to, so that we ensure they are using accepted standards for how to describe those units (ha vs hectares).

Convert Schema JSON-LD into tabular formats?

Currently we have some nice functions (write_spice()) to go from tables -> json-ld. Seems like it would be handy to be able to easily reverse this process as well (e.g. someone gives you a bunch schema.org JSON-LD and you want it as tables where you can filter by geographic coverage and do automated unit conversions based on the measuredVariable metadata etc).

What would such a function be called? read_spice()? What would the return object be? (i.e. a list of data.frames? .csv files on disk? Something else?

Write create_spice()

create_spice(dir)

  • create metadata directory within dir
  • put templates there

Later, make it smarter and start filling in what we can

Improve look and feel for the HTML

The HTML template I made during the unconf is real basic and not all that attractive. I consider making it look better a must-do.

  • @maelle had a great idea of using Blogdown or Hugo to generate a more complex site around the JSON-LD

  • I think it was @cboettig that thought we could automatically put some exploratory data viz type plots or [skimr](https://github.com/ropenscilabs/skimr) output on the page. This would be awesome! See also the cool work in codebook which does a really nice job of this already. Here's what skimr can do:

    ## Variable type: numeric 
    ##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100     hist
    ##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9 ▇▁▁▂▅▅▃▁
    ##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5 ▇▁▁▅▃▃▂▂
    
  • At the very least, make the pages look as good as pkgdown's output. @cboettig mentioned showing a two-column display so the geo coverage and data access are at the top and less of the content is below the fold. I ❀️ ily agree.

Find a way to make `contentURL` in `access.csv` be automatic

I'm not entirely sure this is trivial but I hope it is:

When access.csv gets filled in automatically, the contentURL field is blank. The actual URL to the file would follow the GitHub convention for serving raw files over HTTP:

https://raw.githubusercontent.com/{user|or}/{repo}/{branch}/path/to/file.ext

I think we know this or can find this out before we create the HTML. Take a shot at it and report back! Perhaps the git2r does this in a nice way.

blank cells filled with check boxes when editing .csv files

When a .csv file is partially filled via prep_* or edit_* functions, any remaining blank cells are check boxes when the file is opened again. New information cannot be added to cells.
This might be a bug in the shiny apps, but might be a problem with saving the .csv files. I haven't been able to pinpoint it.

Example workflow that should illustrate the problem:

create_spice(dir = "my_project")
prep_access(data_path = here("my_project"),
            access_path = here("my_project", "metadata", "access.csv"))
edit_access(metadata_dir = here("my_project", "metadata"))

Screenshot of problem:
Screenshot (13)

edit_biblio adds empty row on save

Been playing around with the edit_biblio() app and noticed that when editing and saving a test biblio.csv, a new empty row is added to the file every time the table is saved in the app.

Not tested in other apps yet. Just want to log it.

We should talk...

Hi,

I just became aware of this project, which looks very promising.

We have been working on a similar effort to package research data with schema.org json, also with an HTML file, the DataCrate spec. Looks like the JSON-LD we're producing with various tools (all of which are still in alpha) is quite similar to that here.

Anyway, I think we should look at aligning our efforts. Will anyone from this project be at the Research Object meeting on October 29 in Amsterdam? - I will.

Define metadata template table structures

Define separate CSV/table structures for the internal metadata structures:

(These are just me typing, we haven't decided on these yet)

  • Bibliographic/citation (title (name), abstract (description)
  • Coverage (temporal, spatial, taxonomic)
  • Files (what do we describe here? Download URL, checksum/size?)
  • Attributes (name, unitText, unitCode, value, description)
  • Methods

Other things we might do:

  • Licensing (goes that go in Biblio? probably)
  • other ideas?

Also interested in talking to you! (Psych-DS)

Hi folks - I wanted to introduce a project I'm a part of, including sharing some work a few members have been doing with dataspice! We'd love to think about coordinating efforts with you.

Psych-DS is an in-progress technical specification for datasets in psychology, which uses Schema.org compliant metadata. We began this project as a conference hackathon this summer, and since then have been hashing out the specification and simultaneously coming to learn who all else is working on related projects (including Frictionless Data, and now thanks to your issue #71, DataCrate!). Eventually we hope Psych-DS will support specific subfields to converge on more standard ways of representing particular kinds of data, but have realized that getting social scientists on a shared technical footing & ensuring discoverability is a big deal! The project is heavily inspired by BIDS for neuroimaging data.

We are currently in the stage of wrapping up the draft of the written technical specification and beginning to 'road test' it on some real datasets. Beyond creating dataset-level metadata we are attempting to enforce some basic folder organization, file formatting/naming, and well-structured documentation of variables. (Here's a direct link to the long-form specification doc.)

At the same time, some intrepid coders started working on some standalone R-Shiny apps designed to produce this kind of output. In particular, Erin Buchanan (@doomlab) forked dataspice a while ago to try and tinker with it, and has been working with undergrads to write a tutorial that's appropriate for people who haven't used R before to approach a tool like dataspice.

Here is Erin's update: "We took the structure of dataspice – a grouping of shiny apps that one interacted with using RStudio – and converted it into one Shiny app that is published to the web for anyone to use. The app auto builds the CSV structured files from an uploaded dataset and then allows the user to step through entering the information for the access, attributes, bibliography, and creators files. Here, we fixed a couple typos and other issues that was prohibiting dataspice from being fully functioning and added more detailed instructions. The app still allows users to β€œwrite spice” and create a schema.org compliant JSON file and HTML report with some CSS tweaks."

Erin will have more details on the structure of the version she's been working on, and we'd love to talk about ways to join forces or coordinate. In particular, would it make sense to open some issues to merge back Erin's work, or is there a better way to start out than that? If it makes sense, we can put out a call to the Psych-DS mailing list to see if we have some extra hands to help.

Also, if anyone has time to take a look, we'd love to know if you have any feedback on how Psych-DS could work more smoothly with apps like dataspice, or if you see anything we haven't considered that might cause conflicts.

Thanks all!

  • Melissa & Erin

allow updating, overwriting or trimming of prep_attributes

currently the prep_attributes() function just checks whether variableNames for a file have already been extracted and stops with an error if so. Would be good to offer options to:

  • overwrite any entries associated with a file.
  • append new rows for additional variableNames detected in a file.
  • trim any deprecated rows for variableNames not detected in a file anymore.

spec_tbl_df error

When I try to run the example in your README I get the following error after trying to run prep_attributes():

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
library(here)
#> here() starts at /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpmLF1Hr/reprex31138f9597b
library(dataspice)

create_spice()

data_files <- list.files(system.file("example-dataset/", 
                                     package = "dataspice"), 
                         pattern = ".csv",
                        full.names = TRUE)

attributes_path <- here::here("data", "metadata",
 "attributes.csv")

data_files %>% purrr::map(~prep_attributes(.x, attributes_path),
                         attributes_path = attributes_path)
#> The following variableNames have been added to the attributes file for fileName: BroodTables.csv
#> Stock.ID, Species, Stock, Ocean.Region, Region, Sub.Region, Jurisdiction, Lat, Lon, UseFlag, BroodYear, TotalEscapement, R0.1, R0.2, R0.3, R0.4, R0.5, R1.1, R1.2, R1.3, R1.4, R1.5, R2.1, R2.2, R2.3, R2.4, R3.1, R3.2, R3.3, R3.4, R4.1, R4.2, R4.3, TotalRecruits
#> 
#> Error: Can't combine `..1` <spec_tbl_df<
#>   fileName    : character
#>   variableName: character
#>   description : character
#>   unitText    : character
#> >> and `..2` <tbl_df<
#>   fileName    : character
#>   variableName: character
#>   description : character
#>   unitText    : character
#> >>.

Created on 2020-05-04 by the reprex package (v0.3.0)

Also, it doesn't seem to have actually written anything to attributes.csv like it claims.

simple function to let user "opt in" to linking id's

  • if they have funders, "want to link to FundRef id's?"
  • if taxa are included in coverage, "want to auto-build higher taxa?"

("opting in" might be a funny way to put it if it's more of a "point users to a place where they can find their own id's for their own funders/taxa/etc if those are present in their dataset.")

Discussion: Can/should we be able to import a dataspiced dataset from the web/elsewhere?

This comes from a good question in my dataspice demo today: If user X authors a dataspice page for their dataset, and another scientist, Y, wants to use it, it'd be cool if they just ran:

import_spice("https://amoeba.github.io/some-dataset")

And their computer downloaded something like some-dataset.zip which had the dataspice.json and the files described in access.csv attached to it somehow.

shiny app for populating metadata files

I'm starting to work on a shiny app to take the generated metadata templates and make them such that a user who does not know R, can populate them with the needed details.

Provide a function to export as EML XML?

A colleague wanted to convert their dataspice to an EML 2.1.1 XML doc today but we couldn't figure it out without a few lines of code:

json <- jsonlite::read_json("mydataspice.json")
eml <- emld::as_emld(json)
emld::as_xml(eml)

Do we have this already in dataspice and I just missed it? If not, would it be reasonable and useful to have a function to do this? We could include emld as a soft dependency and wrap the above code (or similar) in a single function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.