Coder Social home page Coder Social logo

We should talk... about dataspice HOT 14 OPEN

ropensci avatar ropensci commented on August 23, 2024 3
We should talk...

from dataspice.

Comments (14)

amoeba avatar amoeba commented on August 23, 2024 1

Hey @ptsefton! I will be there to give a talk on packaging formats. I'd love to chat with you while there!

from dataspice.

ptsefton avatar ptsefton commented on August 23, 2024 1

I've had a go at transforming Dataspice metadata into datacrate metadata to see what happened. I wrote a quick script to transform the metadata to the DataCrate standard (flattened JSON, with an inline @context).

The result is here:
https://data.research.uts.edu.au/examples/v1.0/dataspice-crate/CATALOG.html

I have included the files, but I didn't give them a user-friendly name and description.

In a DataCrate we'd usually try to get a DOI for the dataset, eg by pre-issuing one in Zenodo, and to add URIs for the people.

Other examples.

TODO: Add a map display to Calcyte-generated HTML.

from dataspice.

cboettig avatar cboettig commented on August 23, 2024 1

@ptsefton yes, looks to me like your format is effectively just applying flatten() algorithm to the Dataspice example? (see playground example)

Like @amoeba says, I think going back to the original format can be done by just applying a frame

Technically I think most tools should be agnostic to the format, i.e. a tool designed to consume dataspice would probably always apply its desired frame or compaction to get the data into a canonical format anyway, which means it should be compatible with DataCrate data (or anyone else using the schema.org/Dataset description). (Likewise google's dataset search engine should be equally happy parsing both formats).

Also, it looks like you use https:// urls, note that the canonical urls for Schema.org are specified in http://. (You may need to use the canonical form for things like Google's Dataset search to be able to recognize and parse this).

from dataspice.

ptsefton avatar ptsefton commented on August 23, 2024 1

@cboettig Thanks Carl - I'll fix the URLs to be canonical in the spec.

@amoeba We specified the flattened format because it is MUCH easier for programmers to work with than a nested structure - they can build a simple index of the graph by @id and path then traverse it without having to mess with JSON-LD processing as such. Your example is very simple, but when you have hundreds of files, with multiple creators, locations etc then framing gets very hard (I started out designing DataCrate with framed, nested data but it didn't scale and it was too hard to process).

Re the blank nodes, the goal is to get rid of a lot of them, though the ones for properties are unavoidable. For example, I think it's best if people have @ids, preferably ORCIDs (one of the people in your sample seems to have one, and the other doesn't but I didn't mess with your data and add them). Likewise for equipment, software etc; it's best if these have their own URLs. This is all covered in the spec. (I also need to make the HTML not display blank-node IDs as they're not helpful).

from dataspice.

cboettig avatar cboettig commented on August 23, 2024 1

@ptsefton that's a great point about scaling with nested vs flat format, and a compelling reason for the flat format.

I'd love to learn a bit more about how you're processing / traversing the data. e.g. do you query the JSON text directly (good ol cat/sed/awk pipes)? Use JQ or other json query language? Import into Virtuoso or other RDF database and do SPARQL? Treat the flat graph as a table object and query with standard SQL? Or Query json files directly with SQL queries via Apache Drill or similar? (I've played around with a variety of these but haven't really settled on a strategy myself)

from dataspice.

ptsefton avatar ptsefton commented on August 23, 2024 1

@cboettig Processing / traversing has so far mostly been limited to generating the HTML pages for DataCrate, parsing DataCrates for upload. Here's a little hack I did for a conference - a quick and dirty python script to visualise provenance relationships in our sample file using Plantuml, which I happen to know . You can see in that linked script how simple it is to build an index so things can be looked up by @id. AFAIK there are no Python or Jvascript JSON-LD libraries that make this sort of processing easy.

id_lookup = {}
for item in dc["@graph"]:
    id_lookup[item["@id"]] = item

The result:

@startuml
"Peter Sefton" as Peter_Sefton
[CreateAction:\n Took dog picture] -up-> Peter_Sefton : agent
[CreateAction:\n Took dog picture] -down-> [ImageObject:\npics/2017-06-11 12.56.14.jpg]  : result
[CreateAction:\n Took dog picture] -> [Place:\nCatalina Park] : object
[CreateAction:\n Took dog picture] --down--> [IndividualProduct:\nEPL1 Camera] : instrument
[CreateAction:\n Took dog picture] --down--> [IndividualProduct:\nLumix G 20/F1.7 lens] : instrument
[CreateAction:\n Converted dog picture to sepia] -up-> Peter_Sefton : agent
[CreateAction:\n Converted dog picture to sepia] -down-> [ImageObject:\npics/sepia_fence.jpg]  : result
[CreateAction:\n Converted dog picture to sepia] -> [ImageObject:\npics/2017-06-11 12.56.14.jpg] : object
[CreateAction:\n Converted dog picture to sepia] --down--> [SoftwareApplication:\nImageMagick] : instrument
@enduml

when visualised in Plantuml

image

from dataspice.

cboettig avatar cboettig commented on August 23, 2024

Hi Peter, thanks for getting in touch! It would be great to align efforts. Not sure that any of us will be in Amsterdam for Research Object meeting, but would be happy to follow-up online at least.

from dataspice.

amoeba avatar amoeba commented on August 23, 2024

Wow, Calcyte is super similar to what we thought up. I'm taking a deep dive into this now!

from dataspice.

ptsefton avatar ptsefton commented on August 23, 2024

@amoeba Calcyte documentation is really lacking I'll try to get on to that ASAP, but meanwhile if you want some help, let me know. I worked with @cameronneylon on his dataset via Google Drive - helping with the spreadsheet files - that's an option if you'd like to explore.

from dataspice.

amoeba avatar amoeba commented on August 23, 2024

Thanks, I'm keen to see how you structured your spreadsheets. With dataspice, we decided to start with three primary input methods to get the spreadsheets filled in:

  1. "Manually" (either edit in text editor or CSV/Spreadsheet tool of choice)
  2. In R, e.g. set_title("My title")
  3. Using a Shiny app (if you aren't familiar with Shiny, it lets us make a dynamic web page that can talk to an R session

I think option 3 was the most promising because we still want to provide ample guidance to the user on how best to fill in the templates. At first glance, it looks like you might be solving some of this with how you designed your spreadsheets. I'll have to take a look!

from dataspice.

ptsefton avatar ptsefton commented on August 23, 2024

@amoeba Just pushed an update to Calcyte to fix tests and a bug creating new spreadsheets.

The idea for the spreadsheets came from Mike Lake (@speleolinux) on a previous project; they're generated by the calcyte script and list all the files it finds. Spreadsheets do work but they're not a great UI, and don't really help much with all the relations we want to encourage users to put in so we can link people to the files they created, etc (also tried YAML and that got old very quickly). We have been looking at the best way to provide an interactive app for this and were thinking along the same lines; a web app installed locally, as well as a central one that can talk to data-repositories via their APIs. We want to make it so the user can look up contextual metadata, such as people and places and equipment, using auto-complete and drop downs. We have a project at UTS to build a service catalog/provisioning system for research apps and data storage - part of this will be an index of contextual entities, people, machines etc.

We will definitely will take a look at your Shiny thing ASAP.

from dataspice.

ptsefton avatar ptsefton commented on August 23, 2024

Re the structure of Calcyte spreadsheets - we used xlsx because it supports multiple workbooks, meaning you can have tables of people, places, equipment etc.

(But calcyte is not the only tool, we have people working on scripts to export from a range of data-management systems such as Omero)

from dataspice.

amoeba avatar amoeba commented on August 23, 2024

Thanks for the info. I think metadata authoring is both something a lot of groups have worked on and also something that's not quite "there yet". One of NCEAS' oldest and now nearly-deprecated tools, Morpho (PDF), does (I think) a really nice job of helping author metadata about datasets, export locally, upload to a repository, and even can eventually produce a PDF and HTML representation of the metadata similar to dataspice and DataCrate. NCEAS is now working on a web-based package editor that speaks EML and OAI-ORE Resource Maps as a modern replacement to Morpho. These types of tools target users whose datasets are not necessarily part of a data management system.

from dataspice.

amoeba avatar amoeba commented on August 23, 2024

Very nice @ptsefton! Since dataspice metadata is just Schema.org-only JSON-LD, I imagine the translation was simple? I notice your JSON-LD structure is quite different from ours. I am new to JSON-LD but it looks like you're using a name graph and lots of blank nodes in your serialization. IIRC you can convert (frame?) that format into a more compact one. Is there a reason the DataCrate CATALOG.json is formatted that way?

from dataspice.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.