Coder Social home page Coder Social logo

ocdx-specification's Introduction

Global Data Exchange

Summary of Project

This project is a centralized system for the storing and sharing of data and scripts for researchers such as data scientists and students

Team Members

  1. Brandon Tomblinson
  2. Benjamin Brown
  3. Kienan DeLaney
  4. Daniel Darnold
  5. Chalermpon Thongmotai

Deployment

Server Setup

  1. Launch an Amazon Web Services EC2 instance with Amazon Linux as the operating system, the security group assigned should allow for incoming and outgoing connections on ports 80,22, and 3306
  2. Connect to the instance using Amazon's instructions and the key pair that you assigned to the instance

Installing using the script

  1. Copy and paste the deployment.sh script in the deployment directory into the command line of your server and run it, the script will install all components, clone the GitHub repository, and setup the file system and database for the site

ocdx-specification's People

Contributors

ahnjune avatar ajmillion avatar germonprez avatar kmschuster01 avatar libbyh avatar megansquire avatar sgoggins avatar yuvipanda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ocdx-specification's Issues

So, Building a Front End for Manifest Submission

Do folks have a recommended technology for
a) Generating the JSON file from a basic input?
b) Connecting data files for upload and/or reference
c) Connecting analysis scripts
d) representing data provenance

Optional Specification of Database Connection String for data set

There are cases where the data set may be a database. Do we want live databases to be part of what is kept in the manifest? I think we do; even as an optional parameter. This will support the full lifecycle of people working with live datasets through the emerging JupyterHub/Quarry/Wikibase architecture.

license

Hello,

if the dataset is to be distributed, it has to be distributed by an authorized organization / person. So we need in the metadata a copyright, consequently a licence would be needed, I think

Minimal and Maximal Sets for Privacy and Ethics

Working with @moduloone and @katieshilton on this part of the OCDX. Correspondence and document attached.

Hi all,

I've attached a document that I hope can serve as a starting point for the max-set for the privacyEthics() portion of the manifest.

Essentially, I'd like us to consider including two additional fields (highlighted in yellow in the document):

oversightProvenance(): A field that can contain IRB approval numbers. (Should be used in conjunction with the oversight() field).
tosCompliance(): Which includes three sub-fields, complianceAssertion(), tosVersionInformation(), and tosArchive(). The idea behind this is to create a way for a researcher to indicate that, yes their collection was compliant at the time of data collection (or in a rare case, that no it was not, but we collected it anyways), provide info on what version of a TOS the data was collected under, and include a pointer to an archived copy of the TOS if available.

Cardinality info should be in the attachment.

Finally, while I think this a good start point for the max set, I'm hoping that we may have the oppurtunity to actually talk to some of the early OCDX manifest users (or manifest creators) in order to better understand whether or not these privacy and ethics fields are meeting their needs so we can revise as appropriate.

Thoughts?

Create a manifest key generator and repository

There should be a manifest primary key (identifier for each data set) generator and repository for manifests developed. People should have the option of storing data sets at the same time and this should be connected to the developed Jupyter Hub Infrastructure.

The API should be built to provide important fields like:
Manifest Version
ID
... etc (See Manifest)

Several minor comments

A-5 duplicates A-3
Is C the community producing the data or the community the data are about? I guess the later, but it could be clarified.
It might be useful to separate Ethics from IRB compliance. For the later, using the definitions from the IRB for things like human subjects would be good. Of course, that would be only a US view; I don't know the other rules.
DS-2: what about datasets that aren't published.
Minor typo: D3-6 should be DS-6
DS-6 suggests the need for a taxonomy of processing levels like http://uregina.ca/piwowarj/Think/ProcessingLevels.html
Optional DS-4 and DS-5 reuse the numbers
Optional DS-5 is the start of a bigger category of Provenance. That could even include pointers to the scripts used, the settings, etc.

Update Use Cases to make the "minimal schema" clear.

The minimal use case should be clearly stated, with regards to what must be captured. Then subsequent use cases should be described as the minimum use case "Plus".

The complete set of minimals needs to be parsed out and matched up the the current minimal schema.

Privacy and ethics, for example, needs to be added to the minimal use case. This includes terms of service.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.