Comments (14)
Hey @ptsefton! I will be there to give a talk on packaging formats. I'd love to chat with you while there!
from dataspice.
I've had a go at transforming Dataspice metadata into datacrate metadata to see what happened. I wrote a quick script to transform the metadata to the DataCrate standard (flattened JSON, with an inline @context).
The result is here:
https://data.research.uts.edu.au/examples/v1.0/dataspice-crate/CATALOG.html
I have included the files, but I didn't give them a user-friendly name and description.
In a DataCrate we'd usually try to get a DOI for the dataset, eg by pre-issuing one in Zenodo, and to add URIs for the people.
TODO: Add a map display to Calcyte-generated HTML.
from dataspice.
@ptsefton yes, looks to me like your format is effectively just applying flatten()
algorithm to the Dataspice example? (see playground example)
Like @amoeba says, I think going back to the original format can be done by just applying a frame
Technically I think most tools should be agnostic to the format, i.e. a tool designed to consume dataspice
would probably always apply its desired frame or compaction to get the data into a canonical format anyway, which means it should be compatible with DataCrate data (or anyone else using the schema.org/Dataset description). (Likewise google's dataset search engine should be equally happy parsing both formats).
Also, it looks like you use https://
urls, note that the canonical urls for Schema.org are specified in http://
. (You may need to use the canonical form for things like Google's Dataset search to be able to recognize and parse this).
from dataspice.
@cboettig Thanks Carl - I'll fix the URLs to be canonical in the spec.
@amoeba We specified the flattened format because it is MUCH easier for programmers to work with than a nested structure - they can build a simple index of the graph by @id
and path
then traverse it without having to mess with JSON-LD processing as such. Your example is very simple, but when you have hundreds of files, with multiple creators, locations etc then framing gets very hard (I started out designing DataCrate with framed, nested data but it didn't scale and it was too hard to process).
Re the blank nodes, the goal is to get rid of a lot of them, though the ones for properties are unavoidable. For example, I think it's best if people have @ids, preferably ORCIDs (one of the people in your sample seems to have one, and the other doesn't but I didn't mess with your data and add them). Likewise for equipment, software etc; it's best if these have their own URLs. This is all covered in the spec. (I also need to make the HTML not display blank-node IDs as they're not helpful).
from dataspice.
@ptsefton that's a great point about scaling with nested vs flat format, and a compelling reason for the flat format.
I'd love to learn a bit more about how you're processing / traversing the data. e.g. do you query the JSON text directly (good ol cat/sed/awk pipes)? Use JQ or other json query language? Import into Virtuoso or other RDF database and do SPARQL? Treat the flat graph as a table object and query with standard SQL? Or Query json files directly with SQL queries via Apache Drill or similar? (I've played around with a variety of these but haven't really settled on a strategy myself)
from dataspice.
@cboettig Processing / traversing has so far mostly been limited to generating the HTML pages for DataCrate, parsing DataCrates for upload. Here's a little hack I did for a conference - a quick and dirty python script to visualise provenance relationships in our sample file using Plantuml, which I happen to know . You can see in that linked script how simple it is to build an index so things can be looked up by @id
. AFAIK there are no Python or Jvascript JSON-LD libraries that make this sort of processing easy.
id_lookup = {}
for item in dc["@graph"]:
id_lookup[item["@id"]] = item
The result:
@startuml
"Peter Sefton" as Peter_Sefton
[CreateAction:\n Took dog picture] -up-> Peter_Sefton : agent
[CreateAction:\n Took dog picture] -down-> [ImageObject:\npics/2017-06-11 12.56.14.jpg] : result
[CreateAction:\n Took dog picture] -> [Place:\nCatalina Park] : object
[CreateAction:\n Took dog picture] --down--> [IndividualProduct:\nEPL1 Camera] : instrument
[CreateAction:\n Took dog picture] --down--> [IndividualProduct:\nLumix G 20/F1.7 lens] : instrument
[CreateAction:\n Converted dog picture to sepia] -up-> Peter_Sefton : agent
[CreateAction:\n Converted dog picture to sepia] -down-> [ImageObject:\npics/sepia_fence.jpg] : result
[CreateAction:\n Converted dog picture to sepia] -> [ImageObject:\npics/2017-06-11 12.56.14.jpg] : object
[CreateAction:\n Converted dog picture to sepia] --down--> [SoftwareApplication:\nImageMagick] : instrument
@enduml
from dataspice.
Hi Peter, thanks for getting in touch! It would be great to align efforts. Not sure that any of us will be in Amsterdam for Research Object meeting, but would be happy to follow-up online at least.
from dataspice.
Wow, Calcyte is super similar to what we thought up. I'm taking a deep dive into this now!
from dataspice.
@amoeba Calcyte documentation is really lacking I'll try to get on to that ASAP, but meanwhile if you want some help, let me know. I worked with @cameronneylon on his dataset via Google Drive - helping with the spreadsheet files - that's an option if you'd like to explore.
from dataspice.
Thanks, I'm keen to see how you structured your spreadsheets. With dataspice, we decided to start with three primary input methods to get the spreadsheets filled in:
- "Manually" (either edit in text editor or CSV/Spreadsheet tool of choice)
- In R, e.g.
set_title("My title")
- Using a Shiny app (if you aren't familiar with Shiny, it lets us make a dynamic web page that can talk to an R session
I think option 3 was the most promising because we still want to provide ample guidance to the user on how best to fill in the templates. At first glance, it looks like you might be solving some of this with how you designed your spreadsheets. I'll have to take a look!
from dataspice.
@amoeba Just pushed an update to Calcyte to fix tests and a bug creating new spreadsheets.
The idea for the spreadsheets came from Mike Lake (@speleolinux) on a previous project; they're generated by the calcyte script and list all the files it finds. Spreadsheets do work but they're not a great UI, and don't really help much with all the relations we want to encourage users to put in so we can link people to the files they created, etc (also tried YAML and that got old very quickly). We have been looking at the best way to provide an interactive app for this and were thinking along the same lines; a web app installed locally, as well as a central one that can talk to data-repositories via their APIs. We want to make it so the user can look up contextual metadata, such as people and places and equipment, using auto-complete and drop downs. We have a project at UTS to build a service catalog/provisioning system for research apps and data storage - part of this will be an index of contextual entities, people, machines etc.
We will definitely will take a look at your Shiny thing ASAP.
from dataspice.
Re the structure of Calcyte spreadsheets - we used xlsx because it supports multiple workbooks, meaning you can have tables of people, places, equipment etc.
(But calcyte is not the only tool, we have people working on scripts to export from a range of data-management systems such as Omero)
from dataspice.
Thanks for the info. I think metadata authoring is both something a lot of groups have worked on and also something that's not quite "there yet". One of NCEAS' oldest and now nearly-deprecated tools, Morpho (PDF), does (I think) a really nice job of helping author metadata about datasets, export locally, upload to a repository, and even can eventually produce a PDF and HTML representation of the metadata similar to dataspice and DataCrate. NCEAS is now working on a web-based package editor that speaks EML and OAI-ORE Resource Maps as a modern replacement to Morpho. These types of tools target users whose datasets are not necessarily part of a data management system.
from dataspice.
Very nice @ptsefton! Since dataspice metadata is just Schema.org-only JSON-LD, I imagine the translation was simple? I notice your JSON-LD structure is quite different from ours. I am new to JSON-LD but it looks like you're using a name graph and lots of blank nodes in your serialization. IIRC you can convert (frame?) that format into a more compact one. Is there a reason the DataCrate CATALOG.json
is formatted that way?
from dataspice.
Related Issues (20)
- Feature request: Add file to attributes
- Migrate from spread to pivot_wider
- Address width and scrolling in Shiny apps HOT 1
- Make use of rhandsontable's read only features
- Add ropensci onboarding reviewers in package acknowledgement section
- Complete final tasks from onboarding
- build_site() generates a docs folder even if user has set custom out_path HOT 4
- Using dataspice for multiple datasets
- Display citation and author fields in the html page
- Bug: Cannot use the biblio.csv metadata file due to keywords issue HOT 3
- Document eml_spice functions
- Do an editing pass of shiny apps
- Create 1.0 release HOT 1
- Fix CI HOT 1
- Fix CRAN issues HOT 2
- Go through ropensci onboarding
- Switch default branch to main HOT 1
- Add Test and covr workflows HOT 1
- Session_Info generator HOT 4
- Adding a DOI
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataspice.