ocdx / ocdx-specification Goto Github PK

Specification to describe the minimum information standard for online community data. Guidelines for describing data about online communities.

ocdx-specification's Introduction

Global Data Exchange

Summary of Project

This project is a centralized system for the storing and sharing of data and scripts for researchers such as data scientists and students

Team Members

Brandon Tomblinson
Benjamin Brown
Kienan DeLaney
Daniel Darnold
Chalermpon Thongmotai

Deployment

Server Setup

Launch an Amazon Web Services EC2 instance with Amazon Linux as the operating system, the security group assigned should allow for incoming and outgoing connections on ports 80,22, and 3306
Connect to the instance using Amazon's instructions and the key pair that you assigned to the instance

Installing using the script

Copy and paste the deployment.sh script in the deployment directory into the command line of your server and run it, the script will install all components, clone the GitHub repository, and setup the file system and database for the site

ocdx-specification's People

Contributors

Stargazers

Watchers

Forkers

megansquire brittehcheng halfak sgoggins kmschuster01 sociallycompute anikarenina moduloone germonprez dek8v5

ocdx-specification's Issues

Example manifest for FLOSSmole

So, Building a Front End for Manifest Submission

Do folks have a recommended technology for
a) Generating the JSON file from a basic input?
b) Connecting data files for upload and/or reference
c) Connecting analysis scripts
d) representing data provenance

Move input schemas for data sets into Wikibase

As stated.

implement dataset search interface

possible using the Wikibase infrastructure

Optional Specification of Database Connection String for data set

There are cases where the data set may be a database. Do we want live databases to be part of what is kept in the manifest? I think we do; even as an optional parameter. This will support the full lifecycle of people working with live datasets through the emerging JupyterHub/Quarry/Wikibase architecture.

Propose a GROUP workshop

For Sean.

UI for humans to generate manifests

Can start with Jeremy Dorn's JSON editor

Complete titles and descriptions in the JSON schema

Should probably come off the markdown file that was created with human readable documentation.

incorporate binder into the architecture for data sharing

for using scripts and data together

license

Hello,

if the dataset is to be distributed, it has to be distributed by an authorized organization / person. So we need in the metadata a copyright, consequently a licence would be needed, I think

Figure out how to describe/point to DB-based dataset

For data sets accessed via database connections, we need to make sure Distributions covers that scenario. E.g., what's the best way to indicate a query or tables within a larger database?

Example of manifest for Wikipedia data

Minimal and Maximal Sets for Privacy and Ethics

Working with @moduloone and @katieshilton on this part of the OCDX. Correspondence and document attached.

Hi all,

I've attached a document that I hope can serve as a starting point for the max-set for the privacyEthics() portion of the manifest.

Essentially, I'd like us to consider including two additional fields (highlighted in yellow in the document):

oversightProvenance(): A field that can contain IRB approval numbers. (Should be used in conjunction with the oversight() field).
tosCompliance(): Which includes three sub-fields, complianceAssertion(), tosVersionInformation(), and tosArchive(). The idea behind this is to create a way for a researcher to indicate that, yes their collection was compliant at the time of data collection (or in a rare case, that no it was not, but we collected it anyways), provide info on what version of a TOS the data was collected under, and include a pointer to an archived copy of the TOS if available.

Cardinality info should be in the attachment.

Finally, while I think this a good start point for the max set, I'm hoping that we may have the oppurtunity to actually talk to some of the early OCDX manifest users (or manifest creators) in order to better understand whether or not these privacy and ethics fields are meeting their needs so we can revise as appropriate.

Thoughts?

Nick
OCDX maxset.docx

Create a manifest key generator and repository

There should be a manifest primary key (identifier for each data set) generator and repository for manifests developed. People should have the option of storing data sets at the same time and this should be connected to the developed Jupyter Hub Infrastructure.

The API should be built to provide important fields like:
Manifest Version
ID
... etc (See Manifest)

Add examples/instructions to manifest_commented.json

The info to do so lives in those spreadsheets from the OCDF Omaha meeting.

Example manifest for Twitter data

convert existing example to new manifest version

Several minor comments

A-5 duplicates A-3
Is C the community producing the data or the community the data are about? I guess the later, but it could be clarified.
It might be useful to separate Ethics from IRB compliance. For the later, using the definitions from the IRB for things like human subjects would be good. Of course, that would be only a US view; I don't know the other rules.
DS-2: what about datasets that aren't published.
Minor typo: D3-6 should be DS-6
DS-6 suggests the need for a taxonomy of processing levels like http://uregina.ca/piwowarj/Think/ProcessingLevels.html
Optional DS-4 and DS-5 reuse the numbers
Optional DS-5 is the start of a bigger category of Provenance. That could even include pointers to the scripts used, the settings, etc.

Update Use Cases to make the "minimal schema" clear.

The minimal use case should be clearly stated, with regards to what must be captured. Then subsequent use cases should be described as the minimum use case "Plus".

The complete set of minimals needs to be parsed out and matched up the the current minimal schema.

Privacy and ethics, for example, needs to be added to the minimal use case. This includes terms of service.