Coder Social home page Coder Social logo

datalad-dataverse's Introduction

 ____            _             _                   _
|  _ \    __ _  | |_    __ _  | |       __ _    __| |
| | | |  / _` | | __|  / _` | | |      / _` |  / _` |
| |_| | | (_| | | |_  | (_| | | |___  | (_| | | (_| |
|____/   \__,_|  \__|  \__,_| |_____|  \__,_|  \__,_|
                                              Read me

DOI Test Status Build status Extensions Linters codecov.io Documentation License: MIT GitHub release Supported Python versions Testimonials 4 https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg Contributor Covenant DOI RRID

All Contributors

Distribution

Anaconda Arch (AUR) Debian Stable Debian Unstable Fedora Rawhide package Gentoo (::science) PyPI package

10000-ft. overview

DataLad's purpose is to make data management and data distribution more accessible. To do so, it stands on the shoulders of Git and Git-annex to deliver a decentralized system for data exchange. This includes automated ingestion of data from online portals and exposing it in readily usable form as Git(-annex) repositories - or datasets. However, the actual data storage and permission management remains with the original data provider(s).

The full documentation is available at http://docs.datalad.org and http://handbook.datalad.org provides a hands-on crash-course on DataLad.

Extensions

A number of extensions are available that provide additional functionality for DataLad. Extensions are separate packages that are to be installed in addition to DataLad. In order to install DataLad customized for a particular domain, one can simply install an extension directly, and DataLad itself will be automatically installed with it. An annotated list of extensions is available in the DataLad handbook.

Support

The documentation for this project is found here: http://docs.datalad.org

All bugs, concerns, and enhancement requests for this software can be submitted here: https://github.com/datalad/datalad/issues

If you have a problem or would like to ask a question about how to use DataLad, please submit a question to NeuroStars.org with a datalad tag. NeuroStars.org is a platform similar to StackOverflow but dedicated to neuroinformatics.

All previous DataLad questions are available here: http://neurostars.org/tags/datalad/

Installation

Debian-based systems

On Debian-based systems, we recommend enabling NeuroDebian, via which we provide recent releases of DataLad. Once enabled, just do:

apt-get install datalad

Gentoo-based systems

On Gentoo-based systems (i.e. all systems whose package manager can parse ebuilds as per the Package Manager Specification), we recommend enabling the ::science overlay, via which we provide recent releases of DataLad. Once enabled, just run:

emerge datalad

Other Linux'es via conda

conda install -c conda-forge datalad

will install the most recently released version, and release candidates are available via

conda install -c conda-forge/label/rc datalad

Other Linux'es, macOS via pip

Before you install this package, please make sure that you install a recent version of git-annex. Afterwards, install the latest version of datalad from PyPI. It is recommended to use a dedicated virtualenv:

# Create and enter a new virtual environment (optional)
virtualenv --python=python3 ~/env/datalad
. ~/env/datalad/bin/activate

# Install from PyPI
pip install datalad

By default, installation via pip installs the core functionality of DataLad, allowing for managing datasets etc. Additional installation schemes are available, so you can request enhanced installation via pip install datalad[SCHEME], where SCHEME could be:

  • tests to also install dependencies used by DataLad's battery of unit tests
  • full to install all dependencies.

More details on installation and initial configuration can be found in the DataLad Handbook: Installation.

License

MIT/Expat

Contributing

See CONTRIBUTING.md if you are interested in internals or contributing to the project.

Acknowledgements

The DataLad project received support through the following grants:

  • US-German collaboration in computational neuroscience (CRCNS) project "DataGit: converging catalogues, warehouses, and deployment logistics into a federated 'data distribution'" (Halchenko/Hanke), co-funded by the US National Science Foundation (NSF 1429999) and the German Federal Ministry of Education and Research (BMBF 01GQ1411).

  • CRCNS US-German Data Sharing "DataLad - a decentralized system for integrated discovery, management, and publication of digital objects of science" (Halchenko/Pestilli/Hanke), co-funded by the US National Science Foundation (NSF 1912266) and the German Federal Ministry of Education and Research (BMBF 01GQ1905).

  • Helmholtz Research Center Jülich, FDM challenge 2022

  • German federal state of Saxony-Anhalt and the European Regional Development Fund (ERDF), Project: Center for Behavioral Brain Sciences, Imaging Platform

  • ReproNim project (NIH 1P41EB019936-01A1).

  • Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant SFB 1451 (431549029, INF project)

  • European Union’s Horizon 2020 research and innovation programme under grant agreements:

Mac mini instance for development is provided by MacStadium.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

glalteva
glalteva

💻
adswa
adswa

💻
chrhaeusler
chrhaeusler

💻
soichih
soichih

💻
mvdoc
mvdoc

💻
mih
mih

💻
yarikoptic
yarikoptic

💻
loj
loj

💻
feilong
feilong

💻
jhpoelen
jhpoelen

💻
andycon
andycon

💻
nicholsn
nicholsn

💻
adelavega
adelavega

💻
kskyten
kskyten

💻
TheChymera
TheChymera

💻
effigies
effigies

💻
jgors
jgors

💻
debanjum
debanjum

💻
nellh
nellh

💻
emdupre
emdupre

💻
aqw
aqw

💻
vsoch
vsoch

💻
kyleam
kyleam

💻
driusan
driusan

💻
overlake333
overlake333

💻
akeshavan
akeshavan

💻
jwodder
jwodder

💻
bpoldrack
bpoldrack

💻
yetanothertestuser
yetanothertestuser

💻
Christian Mönch
Christian Mönch

💻
Matt Cieslak
Matt Cieslak

💻
Mika Pflüger
Mika Pflüger

💻
Robin Schneider
Robin Schneider

💻
Sin Kim
Sin Kim

💻
Michael Burgardt
Michael Burgardt

💻
Remi Gau
Remi Gau

💻
Michał Szczepanik
Michał Szczepanik

💻
Basile
Basile

💻
Taylor Olson
Taylor Olson

💻
James Kent
James Kent

💻
xgui3783
xgui3783

💻
tstoeter
tstoeter

💻
Stephan Heunis
Stephan Heunis

💻
Matt McCormick
Matt McCormick

💻
Vicky C Lau
Vicky C Lau

💻
Chris Lamb
Chris Lamb

💻
Austin Macdonald
Austin Macdonald

💻
Yann Büchau
Yann Büchau

💻
Matthias Riße
Matthias Riße

💻
Aksoo
Aksoo

💻
David Guibert
David Guibert

💻
Alex Shields-Weber
Alex Shields-Weber

💻

macstadium

datalad-dataverse's People

Contributors

adswa avatar allcontributors[bot] avatar aqw avatar behinger avatar bpoldrack avatar christian-monch avatar effigies avatar enicolaisen avatar jadecci avatar jernsting avatar jsheunis avatar jwodder avatar ksarink avatar likeajumprope avatar loj avatar matrss avatar mih avatar mslw avatar nadinespy avatar rgbayrak avatar shammi270787 avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

datalad-dataverse's Issues

Set up all-contributors bot to credit contributions

We previously used @all-contributors, a nice and almost fully automatic way of acknowledging contributions in open source projects.

I propose to set it up for this project as well. Documentation on how to do it is here, but it may require the right set of permissions, so whoever takes this issue on, please get in touch if you need anything. :)

Interaction with Dataverse guestbook feature via datalad

A dataverse dataset can have a (customizable) guestbook feature which allows dataset owners to add a set of questions that people have to answer before they can download a specific file or set of files.

As an example, see this data from FZJ on dataverse: https://data.fz-juelich.de/dataset.xhtml?persistentId=doi:10.26165/JUELICH-DATA/T1PKNZ

When selecting "Download" on a file, a popup appears that has to be completed first:

Screenshot 2022-06-02 at 11 01 25

The content of this guestbook can be customized by the dataset owner. It can for example include a link to a data usage agreement and a checkbox that the user has to tick to say that they agree to the DUA. This could be a very useful automated functionality to have for situations where data can be openly shared with the caveat that a log of who accesses it (and that they agreed to the terms) needs to be kept.

However, there are some inefficiencies w.r.t. web-portal vs API access of dataverse data that I have experienced before, but I will have to dig up the details. The summary of the issues I had:

  • the guestbook has to be filled in for every download, this is inefficient if the user will be downloading many files manually
  • bulk download is possible via the webportal, but there was a cap on the amount of files. Datasets with many files could therefore not be downloaded in bulk easily via the portal
  • Download via the api was possible, but it was also possible to circumvent the guestbook feature here

This was for a specific major version of the API (i think 4, but don't bet on that). Hopefully this is not the case anymore currently, but some of these issues might rear their heads again.

Whatever role datalad can play in this domain to make agreeing to the guestbook requirements more seamless would be useful IMO.

Some other links:

Figure how to provide a "docker export"

Current approach on providing a dataverse-docker relies on docker-compose pulling the respective images over network.
For the hackathon, it may not be feasible for everybody to pull several GB over the site's Wifi.

There should be away to have the setup "exported" to an HD or Stick.

We need to figure how to do that (and how to use that as an alternative starting point, obviously)

Implement credential retrieval for create-sibling-dataverse

Right now, this just gets the token from the test environment.
Proper implementation should rely on CredentialManager from datalad-next and consider all the respective configurations.

In create_sibling_dataverse.py is a to be implemented function _get_api_token for that.

Robustify DOI specification for `init|enableremote`

% DATAVERSE_API_TOKEN=5367f732-36bd-46ed-975a-9eeaeb98fe74  git annex initremote dv1 encryption=none type=external externaltype=dataverse url=http://localhost:8080 doi=10.5072/FK2/0NLYPQ
initremote dv1 
git-annex: external special remote error: ERROR: GET HTTP 404 - http://localhost:8080/api/v1/datasets/:persistentId/?persistentId=10.5072/FK2/0NLYPQ. MSG: {"status":"ERROR","message":"Dataset with Persistent ID 10.5072/FK2/0NLYPQ not found."}
failed
initremote: 1 failed

% DATAVERSE_API_TOKEN=5367f732-36bd-46ed-975a-9eeaeb98fe74  git annex initremote dv1 encryption=none type=external externaltype=dataverse url=http://localhost:8080 doi=doi:10.5072/FK2/0NLYPQ
initremote dv1 ok
(recording state in git...)

I think it makes sense to detect a full URL or a plain DOI string and format it correctly as doi:<doistring> for all input styles

Import data to dataverse

I used this script to import data to dataverse:

export API_TOKEN= **********

export SERVER_URL=https://demo.dataverse.org

export DATAVERSE_ID=root

export PERSISTENT_IDENTIFIER=doi.org/******

curl -H X-Dataverse-key:$API_TOKEN -X POST $SERVER_URL/api/dataverses/$DATAVERSE_ID/datasets/:import?pid=$PERSISTENT_IDENTIFIER&release=yes --upload-file dataset.json

Overview of resources

(This needs to be migrated into the README before the hackathon, but I'll create it as an issue just to collect infos)

There is a demo dataverse instance at https://demo.dataverse.org. You need to sign up, but its fast and complication free. Instead of searching for an institution, sign up with your email address and a username of your choice.
The instance allows:

  • Creating new dataverses
  • Creating new datasets

The API guide is at https://guides.dataverse.org/en/5.10.1/api. Of particular interest may be the section https://guides.dataverse.org/en/5.10.1/api/intro.html#developers-of-integrations-external-tools-and-apps, which is about third party integrations. Among other things, it mentions https://pydataverse.readthedocs.io/en/latest, a Python library to access the Dataverse API’s and manipulating and using the Dataverse (meta)data - Dataverses, Datasets, Datafiles. The open science framework also has a dataverse integration: https://github.com/CenterForOpenScience/osf.io/tree/develop/addons/dataverse.
@jsheunis also found: https://guides.dataverse.org/en/latest/admin/metadatacustomization.html

Bootstrap an empty dataverse instance with a dataset

This could become a test helper:

from pyDataverse.api import NativeApi
# dataverse base URL and admin token
api = NativeApi('http://localhost:8080', '5367f732-36bd-46ed-975a-9eeaeb98fe74')

# metadata for a dataverse (collection)
from pyDataverse.models import Dataverse
dvmeta = Dataverse(dict(
    name="myname",
    alias="myalias",
    dataverseContacts=[dict(contactEmail='[email protected]')]
))
# create under the 'root' collection
api.create_dataverse('root', dvmeta.json()).text

# metadata for a dataset in a collection
from pyDataverse.models import Dataset
dsmeta = Dataset(dict(
    title='mytitle',
    author=[dict(authorName='myname')],
    datasetContact=[dict(
        datasetContactEmail='[email protected]',
        datasetContactName='myname')],
    dsDescription=[dict(dsDescriptionValue='mydescription')],
    subject=['Medicine, Health and Life Sciences']
))
# create dataset in the just-created dataverse (collection)
api.create_dataset('myalias', dsmeta.json()).text

Needed: default dataset description

Whenever no custom dataset description is available or provided, we still need to provide one to dataverse. The default looks of a dataverse dataset that had a datalad dataset Git repo pushed to it is this:

image

It would make sense to anchor the default description for this mode on explaining what a datalad dataset is, and how one would work with it, and what the nature of these two files is (and also the similarly looking, equally cryptic annex keys) -- i.e. say that this is not meant to be consumed without datalad.

We already have something similar in https://github.com/datalad-datasets/human-connectome-project-openaccess, which could be tailored for the present needs.

Glossary of Dataverse terms and how they map to DataLad concepts

  • Dataverse installation: A running, deployed dataverse instance

  • Dataverse dataset
    A dataset is a container for data, documentation, code and the metadata describing it. A DOI is assigned to each dataset.
    DatasetDiagram
    Datasets have three levels of metadata:

    • Citation Metadata: any metadata that would be needed for generating a data citation and other general metadata that could be applied to any dataset;
    • Domain Specific Metadata: with specific support currently for Social Science, Life Science, Geospatial, and Astronomy datasets; and
    • File-level Metadata: varies depending on the type of data file

Datasets are created inside of dataverse collections. Users need to make sure to create datasets only in dataverse collections they have permissions to create datasets in. Data upload supports several methods (HTTP, Dropbox upload, rsync + ssh, command line DVUploader), but not all of those are supported by each Dataverse installation, and only one method can be used for each dataset:

If there are multiple upload options available, then you must choose which one to use for your dataset. A dataset may only use one upload method. Once you upload a file using one of the available upload methods, that method is locked in for that dataset. If you need to switch upload methods for a dataset that already contains files, then please contact Support by clicking on the Support link at the top of the application.

  • Dataverse collection
    A Dataverse collection is a container for datasets and other Dataverse collections. Users can create new Dataverse collections using "Add Data" -> "New dataverse" (they by default become administrator of that Dataverse collection) and can manage its settings
    Dataverse-Diagram
    Dataverse collection can be created for a variety of purposes, the granularity that the dataverse project uses to describe it are "Researcher", "Organization", or "Institution". Jülich data has 28 dataverses which are institutional/project level dataverses. Those dataverses can only be created with confirming permissions by Jülich data, but once existing, they can be organized as the insitutes see fit.

  • Dataset linking: A dataverse owner can “link” their dataverse to a dataset that exists outside of that dataverse, so it appears in the dataverse’s list of contents without actually being in that dataverse. One can link other users’ datasets to your dataverse, but that does not transfer editing or other special permissions to you. The linked dataset will still be under the original user’s control.

  • Dataverse linking: Dataverse linking allows a dataverse owner to “link” their dataverse to another dataverse, so the dataverse being linked will appear in the linking dataverse’s list of contents without actually being in that dataverse. Currently, the ability to link a dataverse to another dataverse is a superuser only feature.

  • Publishing a dataverse: Making a created dataverse public, or making datasets public. At least on JuelichData, the very first version of a dataset is called "DRAFT", the first published version is "1.0"

Declare software dependencies

This extension will require token handling. It makes little sense to me (@mih) to fiddle with the implementation in datalad-core. Instead we should use the latest features in datalad-next (which brings datalad_next >= 0.2.0, latest is 0.3.0).

For the implementation of tests, it makes sense to me (@mih) to use pytest right from the start. However, a datalad version (0.17.0 with the necessary utilities) has not been released yet. This implies an additional dependency on datalad-core's master branch --beyond the dependency link via datalad-next.

Additional dependencies will possibly come via #7.

Create a project/package description for the index page of the docs

It would be cool if the index of the docs could be stripped of the generic "datalad extension" content, and be amended with some general overview of the project. Stealing content from the brainhack project pitch description is completely fine, and it would be great if it could mention that the project originates from this hackathon.

JuelichData quirks

As the integration with JülichData (see #15) is a driving motivation behind this project, let's assemble a list of supported or unsupported features and other peculiarities we might need to keep in mind.

Jülich Data's mission statement is:

Jülich DATA primary use case is serving as a registry* for scholarly data.

  • The DOIs of published datasets registered with Jülich DATA may point to external repositories, or point to the landing page on Jülich DATA.
  • The landing page on Jülich DATA gives free and open access to all bibliographic and/or scientific metadata of the respective dataset. All terms of licencing still have to be fulfilled.

* A special type of data repository, storing bibliographic and scientific metadata, but not the data itself. Instead, the data is only linked to from metadata fields. Hybrid forms are possible. Jülich DATA allows for depositing data, but focus is on registry.

Beyond this, their organization docs state:

Every staff member of Research Centre Jülich is (automatically) allowed to add datasets to the Campus Collection, but cannot publish it standalone. Entries are curated by the central library.
Please note: a director might decide you may not add data to the Campus Collection. Rouge entries will be either moved or deleted by its curators.

The consequences arising from this are:

  • Data upload seems possible via HTTP (limited to 2GB per file), but publication of the uploaded files requires review by the central library team.
  • Institutional or project level dataverses have to be created with confirming permissions. Trying to create a new dataverse instance results in an XML parson error for normal users. The docs say "decentralized data managers" are allowed to do that.
  • Datasets at root level of the dataverse installation are not allowed. Trying to click "Add data" under your user account without selecting a collection causes an XML parsing error for normal users.
    Screenshot from 2022-06-09 13-46-24
  • New datasets can be created in the "Campus" collection https://data.fz-juelich.de/dataverse/campus, they are in draft mode and publication requires review by the central library
  • Metadatawise, there is additional "FZJ Metadata" with drop down menues "institute" and "PoV IV topic"

Set up a Zenodo.json or CITATION.cff file

Once we made a release to PyPi, we will be archiving the release on Zenodo for preservation and citability. This requires one of two files:

a) Either a zenodo.json file. A minimal example is already in this repo. It would need to contain the names of all contributors as well as some minimal metadata (a useful example could be the zenodo.json file from datalad-osf)
b) Or a CITATION.cff file. This is a new feature from Github, and just a few days ago, Zenodo announced support for it. An example is in the datalad repository. If we go for CITATION.cff, we need to remove the zenodo.json file from this repository, else Zenodo will pick this one up instead.

Documenting experience with Dataverse v4 issues

In case this could prove useful, I am writing down a few notes from my previous experience with creating a collection + dataset using an installation of Dataverse 4 in 2020/2021, on dataverse.nl. This is the dataset in question: https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/R1TNL8

Issue 1

One main issue that I encountered was with uploading the full BIDS dataset via the web portal's HTTP upload interface. I had received a vague error message like "This file already exists" with no reference to the duplicate files. I had to inspect this manually in the file list (and also via API). I found the files in question to be missing, and when trying to upload them again, received the same issue. This was then found to be a known issue in dataverse. Here's just a quick selection of closed issues that highlight the progression of these challenges:

The workaround at that point was to upload the files in compressed format. I don't know why this worked, but it did (after many tries).

Issue 2

The compressed files upload approach also "solved" the other issue that I had, which was that the HTTP uploader couldn't handle more than ~1000 files at a time. However, to be sure that I could make use of the tree view of files in the dataset, I had to compress my subset of files into the same directory structure that it would exist in in the final tree. I didn't have to explicitly specify the relative path of a file with extra metadata. I suspect once dataverse processed the compressed file after upload, there was some internal mapping to the path field in file-specific metadata.

Here is a link to information on the dataverse user docs regarding compressed files: https://guides.dataverse.org/en/5.10.1/user/dataset-management.html#compressed-files

Issue 3

Another issue was bulk download of the whole dataset from the webportal, which resulted in a 414 Request-URI Too Long error. Here's a related issue:

The suggestion then was to wait for the upgrade to dataverse v5 to address this issue. I waited for a long time, but perhaps they have (hopefully) already addressed this issue and completed the upgrade.

Issue 4

Lastly, I also encountered an issue with the guestbook functionality. It was possible to circumvent the guestbook functionality via the API, which is meant to get people to agree to some data use terms digitally before download. I had to restrict all files in the dataset as a workaround for this issue, and this required manual access granting whenever someone requests a download. This is of course the opposite of what wants to achieve with the guestbook functionality. The guestbook functionality and issue are described here:

Determine how a datalad-clone from dataverse would work (user perspective)

The URL for a dataset shown in the browser is something like

http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/CHHQWH&version=DRAFT

The same page is pulled with the URL

http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/CHHQWH

This seems relatively specific (enough for dataverse) and pretty easy to map to the actual datalad-annex URL we could immediately clone from:

datalad-annex::?type=external&externaltype=dataverse&url=http%3A//localhost%3A8080&doi=doi%3A10.5072/FK2/CHHQWH&encryption=none

DataLad has a mapping mechanism for that. Here is an example from the -next extension

register_config(
    'datalad.clone.url-substitute.webdav',
    'webdav(s):// clone URL substitution',
    description="Convenience conversion of custom WebDAV URLs to "
    "git-cloneable 'datalad-annex::'-type URLs. The 'webdav://' "
    "prefix implies a remote sibling in 'filetree' or 'export' mode "
    "See https://docs.datalad.org/design/url_substitution.html for details",
    dialog='question',
    scope='global',
    default=(
        r',^webdav([s]*)://([^?]+)$,datalad-annex::http\1://\2?type=webdav&encryption=none&exporttree=yes&url={noquery}',
    ),
)

Here are the full docs for this feature: http://docs.datalad.org/en/latest/design/url_substitution.html

Programmatically determine list of dataset "subjects"

Calling pyDataverse.api.NativeApi.create_dataset() requires to specified "subjects". This is not optional, and it cannot be an arbitrary string.

The WebUI exposes a predefined list of identifiers

image

we have to be able to query this list programmatically, in order to be able to give meaningful advice for composing this mandatory dataset metadata.

Improve error with missing dataset DOI

Current state:

% DATAVERSE_API_TOKEN=5367f732-36bd-46ed-975a-9eeaeb98fe74  git annex initremote dv1 encryption=none type=external externaltype=dataverse url=http://localhost:8080
initremote dv1 
git-annex: external special remote error: ERROR: GET HTTP 404 - http://localhost:8080/api/v1/datasets/:persistentId/?persistentId=. MSG: {"status":"ERROR","message":"Dataset with Persistent ID  not found."}
failed
initremote: 1 failed

What is Dataverse's data concept?

We need a good grasp on what's dataverse's notion of dataverses, collections, datasets, files and what data deposition looks like in order to have an idea of what could be done with respect to an datalad export of sorts.

Create CONTRIBUTING.md

It would be good to have a CONTRIBUTING.md file added to this repository, similar to the one in the datalad-osf repo.
We should migrate the instructions, from cloning or python environment setup to how to set up a dataverse instance from scratch with Docker in there.

Draft a create-sibling command

We need to figure how needs to look like in principle.
For example: Creating a collection and the dataset (and its subdatasets) within vs. using an existing one. I guess both should be possible. However, using an existing one may imply that we can't put anything in the collections metadata. It may need a different mode of operation rather than just a "create or not".

Action items from Juelich's research data management challenge

One of the main causes for this project was commitment to an internal "Research data management challenge" of the Research Center Juelich, in which we aim to provide integration between the imaging core facility data acquisition site and the dataverse instance of the Jülich central library "Jülich Data" (data.fz-juelich.de).
Therefore, the workpackages in the proposal are a source of project aims for this hackathon. Here are excerpts from the introduction and workpackage 3 that are relevant, with added emphasis by me:

"and INM data output will become more findable, accessible, interoperable, and reusable (FAIR) via standardization and integration with Jülich DATA"
WP3: [...] We will partner with the FZJ Central Library to develop a metadata schema and workflow to enable INM-ICF users to programmatically register a dataset with the Jülich Data portal. [...] Based on previous work on the interoperability of this solution with key services, such as the OpenScience Framework (http://docs.datalad.org/projects/osf), we will provide INM-ICF users with software that can create and populate a dataset record on Jülich Data. [...] As a use case for demonstrating the to-be-developed metadata schema and workflow applied to a heterogeneous dataset, we will register the Jülich Dual-Tasking & Aging study, comprising a host of different types of MRI, behavioral and self-report data, with the Jülich DATA portal. Within this project metadata reporting will be restricted to anonymous information, such as acquisition parameters, QC metrics, general study descriptions.

I extract the following out of it:

Software features

  • Create a dataverse dataset inside of a dataverse collection (see #14 for a delineation of the terms)
  • Perform a dataset export to the dataverse dataset via their supported upload protocols. There are several supported protocols, but not all of them might be enabled for a given dataverse installation.

Metadata

  • Check the metadata schemas in dataverse. There are three levels of metadata: Citation Metadata (any metadata that would be needed for generating a data citation and other general metadata that could be applied to any dataset); Domain Specific Metadata (with specific support currently for Social Science, Life Science, Geospatial, and Astronomy datasets); and File-level Metadata (varies depending on the type of data file)
  • Check for feasibility of metadata extractors for the available schemas

Usecase/Examples

  • Take a look at the DTA dataset and check how easy the required anonymous information can be extracted
  • Figure out how this metadata can best be mapped to available dataverse metadata standards
  • Find out which version of dataverse is run at the central library, and what the supported features are (who can create what, which data upload protocols are enabled, ...).
  • Make a list of features the Jülich dataverse has or hasn't, check that concepts in #14 apply to it

Docker hub is unhappy with our test frequency

ERROR: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
Build exited with code 1

We should limit the runs (maybe disable the branch CI run and just keep the PR one). Or follow the advice.

Implement test for (export) of identical files under different names

This should be a non-issue for a normal (non-export) upload -- deduplication is inherent with the annex-key setup. But #18 suggests that complications could arise from a dataset/verse containing multiple redundant copies of the same file under different names.

We should have a test for that.

Not specific to dataverse.org

create-sibling-dataverse docs state:

Create a dataset sibling(-tandem) on dataverse.org.

This is misleading. it should work for any dataverse deployment.

Determine Dataverse API equivalents for git-annex special remote operations

We need to explore what types of operations can be mapped onto the dataverse API in order to see what functionality can be supported at this most basic level. It would be very attractive to use this layer, because it would automatically enable interoperability with dataverse for file deposition, file retrieval, Git history deposition, datalad-clone from dataverse DOIs, and even usage outside datalad with git-annex directly.

Note that git-annex operates strictly in a per-file-version mode, so things like bulk-uploads of ZIP file etc. are not part of the possibilities.

Here is the list of operations:

Must have

  • TRANSFER STORE: file upload to deposit under a name matching the annex key (git-annex content identifier)
  • TRANSFER RETRIEVE file download based on an annex key
  • CHECKPRESENT: report whether file matching a particular annex key is available. It is imported that the implementation does not report presence, when in-fact an upload is still ongoing and not complete.
  • REMOVE: Delete a file matching a particular annex key

Nice to have

  • SETURLPRESENT determine a (persistent) URL where file content can be downloaded for a file deposited by TRANSFER STORE
  • TRANSFEREXPORT STORE: Like TRANSFER STORE but store a file under a specific filename on dataverse
  • TRANSFEREXPORT RETRIEVE: Like TRANSFER RETRIEVE, but retrieve file content by a file name given to TRANSFEREXPORT STORE rather than an annex key
  • CHECKPRESENTEXPORT: Like CHECKPRESENT, but check for a filename given to TRANSFEREXPORT STORE
  • REMOVEEXPORT: Like REMOVE, but remove a file by filename given to TRANSFEREXPORT STORE

To make things more efficient/convenient

  • REMOVEEXPORTDIRECTORY: remove all files with "logical" paths (as given to TRANSFEREXPORT STORE) inside a particular directory
  • RENAMEEXPORT: Change the "logical" path of a deposited file

Extremely nice to have

Ability to report a list of files in a dataset with the following information for each file

  • filename (logical/tree path)
  • size in bytes
  • content identifier: and identifier provided by (or computable from information provided by) dataverse that is
    • stable, so when a file has not changed, its content identifier remains the same
    • changes when a file is modified
    • be as unique as possible, but not necessarily fully unique. A hash of the content would be ideal, but a (size, mtime, inode) tuple is better than nothing
    • be reasonably short (needs to be tracked)

and possibly the same even for historical version.

check if filename validation is needed and if so, write helper

During manual dataset creation, "upload via the web interface" allows to specify a "file path", a "hierarchical directory structure path used to display file organization and support reproducibility". An uploaded file will be placed under the file path given in this path. E.g., uploading "myfile.txt" and supplying "this/is/a/path/" results in the downloaded dataset to be a zip file with the directory tree this/is/a/path/myfile.txt. (dones't affect its visualization on dataverse it seems, there is it a flat list of files). This file path has character restrictions:

Directory Name cannot contain invalid characters. Valid characters are a-Z, 0-9, '_', '-', '.', '', '/' and ' ' (white space).

If we make use of these paths, we need a helper to ensure only valid characters are used. File names seem to be fine with non-valid characters, e.g., I was able to publish a file with an "ü" in the demo instance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.