j535d165 / datahugger Goto Github PK

View Code? Open in Web Editor NEW

56.0 2.0 9.0 3.96 MB

One downloader for many scientific data and code repositories! DOI :open_hands: Data

Home Page: https://J535D165.github.io/datahugger/

License: MIT License

Python 100.00%

scientific scientific-data cli data dataverse dryad figshare github python repository

datahugger's Introduction

Datahugger - Where DOI 👐 Data

Datahugger is a tool to download scientific datasets, software, and code from a large number of repositories based on their DOI (wiki) or URL. With Datahugger, you can automate the downloading of data and improve the reproducibility of your research. Datahugger provides a straightforward Python interface as well as an intuitive Command Line Interface (CLI).

Supported repositories

Datahugger offers support for more than 377 generic and specific (scientific) repositories (and more to come!).

We are still expanding Datahugger with support for more repositories. You can help by requesting support for a repository in the issue tracker. Pull Requests are very welcome as well.

Installation

Datahugger requires Python 3.6 or later.

pip install datahugger

Getting started

Datahugger with Python

Load a dataset (or any digital asset) from a repository with the datahugger.get() function. The first argument is the DOI or URL, and the second is the folder name to store the dataset (it will be created if it does not exist).

The following code loads dataset 10.5061/dryad.mj8m0 into the folder data.

import datahugger

# download the dataset to the folder "data"
datahugger.get("10.5061/dryad.mj8m0", "data")

For an example of how this can integrate with your work, see the example workflow notebook or

Datahugger with command line

The command line function datahugger provides an easy interface to download data. The first argument is the DOI or URL, and the second argument is the name of the folder to store the dataset (will be created if it does not exist).

datahugger 10.5061/dryad.mj8m0 data

% datahugger 10.5061/dryad.mj8m0 data
Collecting...
NestTemperatureData.csv            : 100%|████████████████████████████████████████| 607k/607k
README_for_NestTemperatureData.txt : 100%|██████████████████████████████████████| 2.82k/2.82k
ExternalTemps.csv                  : 100%|██████████████████████████████████████| 1.06k/1.06k
README_for_ExternalTemps.txt       : 100%|██████████████████████████████████████| 2.82k/2.82k
InternalEggTempData.csv            : 100%|██████████████████████████████████████████| 664/664
README_for_InternalEggTempData.txt : 100%|██████████████████████████████████████| 2.82k/2.82k
SoilSimulation_Output.csv          : 100%|████████████████████████████████████████| 229M/229M
README_for_SoilSimulation_[...].txt: 100%|██████████████████████████████████████| 2.82k/2.82k
Dataset successfully downloaded.

Tip: On some systems, you have to quote the DOI or URL. For example: datahugger "10.5061/dryad.mj8m0" data.

Tips and tricks

No need to struggle with DOIs versus DOI URLs. They both work (and more). Example: The values 10.5061/dryad.x3ffbg7m8, doi:10.5061/dryad.x3ffbg7m8, https://doi.org/10.5061/dryad.x3ffbg7m8, and https://datadryad.org/stash/dataset/doi:10.5061/dryad.x3ffbg7m8 all point to the same dataset.
Do not republish the dataset when uploading your data to a scientific data repository. These storage resources can be used better :)

License

MIT

Contact

Please feel free to reach out with questions, comments, and suggestions. The issue tracker is a good starting point. You can also email me at [email protected].

datahugger's People

Contributors

Stargazers

Watchers

Forkers

peterlombaers jteijema sexyvetra senui kianmeng davetromp stsnel micafer

datahugger's Issues

Remove timing from progress bar to avoid needless changes

For the sake of reproducibility and traceability, it might be better to remove the timing and download speed from the final result. This avoids git from tracking changes.

% datahugger 10.5061/dryad.x3ffbg7m8 data
README_Pfaller_Robinson_20[...].txt: 100%|█████████████████████████████████████| 17.1k/17.1k [00:00<00:00, 2.62MB/s]
Pfaller_Robinson_2022_Glob[...].csv: 100%|████████████████████████████████████████| 709k/709k [00:00<00:00, 904kB/s]
Repository content succesfully downloaded.

Implement pagination for max results per page

Thank you very for this very useful resource.

I'm encountering a weird case where the downloader misses many files on osf.io

files = datahugger.get('https://osf.io/3jhtb/', 'macgregor_2019')

Most of the .wav files in the Stimuli directory are neither listed nor downloaded.

Any idea?

Thank you!

[NEW REPOSITORY REQUEST] Kaggle

What's the name and URL of the repository?
Kaggle. kaggle.com

Does the repository support DOIs? If so, please provide one or more DOIs
No.

Are you interested in implementing support for this repository in Datahugger
Waiting for the community.

Add support for Mendeley Data

Support for external of storage at OSF

This is such a great package, thank you :-)
It seems that support for OSF works with OSF's native storage but not external storage mounted at OSF.
Why is it useful to include that?
OSF suports external storage, e.g. any webDAV service (except for Yoda, which for some reason doesn't work) can be added as an Add-On to an OSF project. There are many reasons to do this, e.g one can write data from software directly to osf.io by mounting the external storage as a volume on their system. For example, I use SURF Research Drive for that.
But downloading from osf.io using datahugger doesn't seem to include the files in external storage. When I moved them to osf.io storage, it worked.

You can see my use case here:
https://osf.io/ews27/
https://github.com/MindTheGap-ERC/LMA_utils/blob/main/Track_U_at_bottom.ipynb

Failed to download: not well-formed (invalid token): line 27, column 80

% datahugger https://hdl.handle.net/10622/NHJZUD tmp --log-level DEBUG
Collecting...
DEBUG:root:Resolve service: search netloc 'hdl.handle.net'
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): doi.org:443
DEBUG:urllib3.connectionpool:https://doi.org:443 "GET /None HTTP/1.1" 404 None
Traceback (most recent call last):
  File "/Users/.../versions/sra-dev/bin/datahugger", line 8, in <module>
    sys.exit(main())
  File "/Users/.../datahugger/datahugger/__main__.py", line 104, in main
    raise err
  File "/Users/.../datahugger/datahugger/__main__.py", line 83, in main
    get(
  File "/Users/.../datahugger/datahugger/api.py", line 267, in get
    return _base_request(
  File "/Users/.../datahugger/datahugger/api.py", line 206, in _base_request
    service_class = _resolve_service(url, doi)
  File "/Users/.../datahugger/datahugger/api.py", line 318, in _resolve_service
    service_class = _resolve_service_with_re3data(doi)
  File "/Users/.../datahugger/datahugger/api.py", line 343, in _resolve_service_with_re3data
    publisher = get_datapublisher_from_doi(doi)
  File "/Users/.../datahugger/datahugger/utils.py", line 100, in get_datapublisher_from_doi
    tree = ET.fromstring(r.content)
  File "/Users/.../versions/3.9.7/lib/python3.9/xml/etree/ElementTree.py", line 1347, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 27, column 80

Add support for OSF

auto_unzip option is not available

From the documentation I understand the following:

Some services like Zenodo don't offer an option to preserve folder structures. Therefore, the content is often zipped before being uploaded to the service. In this case, Datahugger will unzip the file to the output folder by default.

Disable auto unzip function

datahugger.get("10.5061/dryad.x3ffbg7m8", "data", auto_unzip=False)

The auto_unzip parameter is no longer in the code. I do see that the unzip parameter can be set, but is is not used to skip the unzip process.

So:
datahugger.get("10.5061/dryad.x3ffbg7m8", "data", auto_unzip=False)
also does not skip the unzip process.

We would need to change the code so it takes into account the unzip parameter.

Add async download option

Add option to unzip all archives

Add option to unzip all archives, instead of only a "single zip".

[NEW REPOSITORY REQUEST] Yoda

What's the name and URL of the repository?
Yoda. Self-hosted service like Dataverse. Example instance https://public.yoda.uu.nl/.

Does the repository support DOIs? If so, please provide one or more DOIs
Yes, DOIs are supported. For example 10.24416/UU01-8TX6RL

Are you interested in implementing support for this repository in Datahugger
Waiting for the community.

[NEW REPOSITORY REQUEST]

What's the name and URL of the repository?
Name: Human BioMolecular Atlas Program,
URL: https://portal.hubmapconsortium.org

Does the repository support DOIs? If so, please provide one or more DOIs
Yes
https://doi.org/10.35079/HBM334.QWFV.953
https://doi.org/10.35079/HBM353.NZVQ.793

Are you interested in implementing support for this repository in Datahugger
Feel free to say no. Maybe other readers are interested in implementing this.

fairly toolset

Hi @J535D165, @c-martinez informed us about your software. We are also working on providing efficient access methods to research data sets for a while now, and developed https://github.com/ITC-CRIB/fairly toolset, including a Python library, command line interface, and JupyterLab extension. The toolset allows local dataset creation, metadata management, smart upload/download with smart differential push and pull mechanisms. I see that you focus on downloading files only, but support more platforms. Would you be interested to collaborate? Please have a look at our repository and let us know if you will be interested.

Add support for DataONE

Missing dep

I installed the package using pip in a new env and ran the example given in the README.

First, I got an error that pandas is not installed. It seems that pandas is a hard dep, but it's included as an optional dep in pyproject.toml. Then, I installed pandas, ran the README example (with and without quotes) and got this error:

$ datahugger 10.5061/dryad.mj8m0 data1  
Error: 'stash:file-download'

Do you think this is a data repository that needs to be supported?
Please request support in the issue tracker:

	https://github.com/J535D165/datahugger/issues/new/choose

Support datasets from the BioImage archive

Hi @J535D165 ,

I'm trying to download a dataset from the Bioimage archive, for example this one. But it raises a non-supported error:

❯ datahugger 10.6019/S-BIAD1232 test-bioimage                                           
Error: Data protocol for 10.6019/S-BIAD1232 not found.

Do you think this is a data repository that needs to be supported?
Please request support in the issue tracker:

	https://github.com/J535D165/datahugger/issues/new/choose

However, it looks that the Bioimage Archive is actually integrated in the Dataverse, which is in turn supported by Datahugger.
So my questions really is: how can I download a dataset from a Dataverse-supported repository?

Thanks. :)

implement checksum checking as feature of datahugger

Currently I use datahugger in a project to retrieve dataset files. I also use datahugger to get the hash and hash type corresponding to that file. I have written some python code that hashes the downloaded files using the correct hash type. The results are compared with the hashes given by the service.
This way I can check if the files checksums are OK and I can continue processing them.

The idea is to integrate this checksum checking as a feature of datahugger, maybe as part of the get method.

For example:

datahugger.get("10.5061/dryad.x3ffbg7m8", "data", checksum=True)

Would calculate the hashes of the downloaded files, compare them to the given hashes and report back to the user which files have or do not have matching hashes.