Coder Social home page Coder Social logo

emg-toolkit's Introduction

Tests PyPi package Downloads

Metagenomics toolkit enables scientists to download all of the sample metadata for a given study or sequence to a single csv file.

Install metagenomics toolkit

Through pip

pip install -U mg-toolkit

Or using conda

conda install -c bioconda mg-toolkit

Usage

$ mg-toolkit -h
usage: mg-toolkit [-h] [-V] [-d]
                  {original_metadata,sequence_search,bulk_download} ...

Metagenomics toolkit
--------------------

positional arguments:
  {original_metadata,sequence_search,bulk_download}
    original_metadata   Download original metadata.
    sequence_search     Search non-redundant protein database using HMMER
    bulk_download       Download result files in bulks for an entire study.

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         print version information
  -d, --debug           print debugging information

Examples

Download metadata:

$ mg-toolkit original_metadata -a ERP001736

Search non-redundant protein database using HMMER and fetch metadata:

$ mg-toolkit sequence_search -seq test.fasta -out test.csv -db full evalue -incE 0.02

Databases:
- full - Full length sequences (default)
- all - All sequences
- partial - Partial sequences

How to bulk download result files for an entire study?

usage: mg-toolkit bulk_download [-h] -a ACCESSION [-o OUTPUT_PATH]
                                [-p {1.0,2.0,3.0,4.0,4.1,5.0}]
                                [-g {statistics,sequence_data,functional_analysis,taxonomic_analysis,taxonomic_analysis_ssu_rrna,taxonomic_analysis_lsu_rrna,non-coding_rnas,taxonomic_analysis_itsonedb,taxonomic_analysis_unite,taxonomic_analysis_motupathways_and_systems}]

optional arguments:
-h, --help            show this help message and exit
-a ACCESSION, --accession ACCESSION
                        Provide the study/project accession of your interest, e.g. ERP001736, SRP000319. The study must be publicly available in MGnify.
-o OUTPUT_PATH, --output_path OUTPUT_PATH
                        Location of the output directory, where the downloadable files are written to.
                        DEFAULT: CWD
-p {1.0,2.0,3.0,4.0,4.1,5.0}, --pipeline {1.0,2.0,3.0,4.0,4.1,5.0}
                        Specify the version of the pipeline you are interested in.
                        Lets say your study of interest has been analysed with
                        multiple version, but you are only interested in a particular
                        version then used this option to filter down the results by
                        the version you interested in.
                        DEFAULT: Downloads all versions
-g {statistics,sequence_data,functional_analysis,taxonomic_analysis,taxonomic_analysis_ssu_rrna,taxonomic_analysis_lsu_rrna,non-coding_rnas,taxonomic_analysis_itsonedb,taxonomic_analysis_unite,taxonomic_analysis_motupathways_and_systems}, --result_group {statistics,sequence_data,functional_analysis,taxonomic_analysis,taxonomic_analysis_ssu_rrna,taxonomic_analysis_lsu_rrna,non-coding_rnas,taxonomic_analysis_itsonedb,taxonomic_analysis_unite,taxonomic_analysis_motupathways_and_systems}
                        Provide a single result group if needed.
                        Supported result groups are:
                        - statistics
                        - sequence_data (all versions)
                        - functional_analysis (all versions)
                        - taxonomic_analysis (1.0-3.0)
                        - taxonomic_analysis_ssu_rrna (>=4.0)
                        - taxonomic_analysis_lsu_rrna (>=4.0)
                        - non-coding_rnas (>=4.0)
                        - taxonomic_analysis_itsonedb (>= 5.0)
                        - taxonomic_analysis_unite (>= 5.0)
                        - taxonomic_analysis_motu  (>= 5.0)
                        - pathways_and_systems (>= 5.0)
                        DEFAULT: Downloads all result groups if not provided.
                        (default: None).

How to download all files for a given study accession?

$ mg-toolkit -d bulk_download -a ERP009703

How to download results of a specific version for given study accession?

$ mg-toolkit -d bulk_download -a ERP009703 -v 4.0

How to download specific result file groups (e.g. functional analysis only) for given study accession?

$ mg-toolkit -d bulk_download -a ERP009703 -g functional_analysis

The bulk uploader will store a .tsv file with all the metadata for each downloaded file.

Usage as a python package

⚠️ Liable to change ⚠️

Whilst mg_toolkit is designed as a command-line tool, it is a set of python modules with helper classes that could be useful in your own python scripts. These internal APIs and call signatures may change over time. See main() for default arguments.

Example

from mg_toolkit.metadata import OriginalMetadata
erp001736 = OriginalMetadata('ERP001736')
erp001736.fetch_metadata()

Development setup

Install the package in edit mode, and additional dev requirements (pre-commit hooks and version bumper).

pip install -e . -r requirements-dev.txt
pre-commit install

You can bump the version with e.g. bump2version patch.

Contributors

Thanks goes to these wonderful people (emoji key):


Ola Tarkowska

💻📖

Maxim Scheremetjew

💻📖

Martin Beracochea

💻

Emil Hägglund

💻

Sandy Rogers

💻

This project follows the all-contributors specification. Contributions of any kind welcome!

Contact

If the documentation do not answer your questions, please contact us.

emg-toolkit's People

Contributors

atarkowska avatar caballero avatar emilhaegglund avatar mberacochea avatar sandyrogers avatar tomblaze avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

emg-toolkit's Issues

Bulk download issue with result pagination

I've been trying to download all functional analyses for a study (MGYS00000410). For some analyses which are accessible via the MGnify site no files were being downloaded. One of the examples is MGYA00005084. Logging in the console doesn't look like it attempts to download files and fails, rather that it moves on after the request to the API without attempting any downloads.

For the analyses where no files were downloaded, I noticed in the API results with default page_size, items with the group-type 'Functional analysis' don't seem to appear until the after the first page. I'm not sure that's the cause though it seems to be the case for the cases I checked.

Add an "-resume" option for the bulk_downloader

MGnify has some very large studies, downloading those is problematic. With the current implementation if there is a network issue there is no way to restart the download process using the files already downloaded.

This feature will require (this is just a brain dump)

  • Store the tool progress status in a .sqlite db or a text file (pages and the download status for each page, how many pages...)
  • Add a "--resume" flag or sniff at the results folder before starting downloading data
  • Use the state to start downloading from that point
  • Check the files check the downloaded file checksums

emg-toolkit as a python method

Hi,

Awesome work here!

I would like to know how I can use emg-toolkit as a python method, such as:

from mg_toolkit import original_metadata
original_metadata('ERP001736')

Sorry if I missed this in the documentation.

Best regards,
Matin N.

Overwrites already downloaded data.

My bulk_download stopped because of a HTPP ERROR 500.
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

But when i restart the download with the same command it overwrites the already existing data. The download already took 30 minutes and wasn't even half way.

Is it possible that the code first checks if there are already data from a failed download and skips these while downloading ?

Command i used to download :
bash mg-toolkit bulk_download -a MGYS00001225 -g taxonomic_annotations

Versions:

  • python:3.6.7 conda-forge
  • mg-toolkit:0.6.4

Failed to download metadata

Hello,
I've been trying to download metadata for this project ERP005534 / PRJEB6070 (I tried both) but it fails to download any metadata. I've had waited for 10 plus minutes but nothing happens. I tried the -d option and I got this:

$ mg-toolkit -d original_metadata -a ERP005534
DEBUG: Accession ERP005534
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/portal/api/search?result=read_run&query=study_accession%3DERP005534+OR+secondary_study_accession%3DERP005534&fields=run_accession%2Csecondary_sample_accession%2Csample_accession%2Cdepth&format=json HTTP/1.1" 200 None
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/browser/api/xml/ERS433375 HTTP/1.1" 200 None
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/browser/api/xml/ERS433376 HTTP/1.1" 200 None
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/browser/api/xml/ERS433377 HTTP/1.1" 200 None
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
...

Then I tried the example in the README and also got the same behavior:

$ mg-toolkit -d original_metadata -a ERP001736
DEBUG: Accession ERP001736
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/portal/api/search?result=read_run&query=study_accession%3DERP001736+OR+secondary_study_accession%3DERP001736&fields=run_accession%2Csecondary_sample_accession%2Csample_accession%2Cdepth&format=json HTTP/1.1" 200 None
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/browser/api/xml/ERS478017 HTTP/1.1" 200 None
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/browser/api/xml/ERS477998 HTTP/1.1" 200 None
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/browser/api/xml/ERS477979 HTTP/1.1" 200 None
....

Does this just mean that there isn't metadata available for these projects (None)?

Regards

mg-toolkit for confidential MGnify study

Dear Ola,

I would like to try out the mg-toolkit for retrieving the metadata etc. for our team's study, ERP116156, (I am following the tutorials we learnt at EMBL metagenomics bioinformatics 2018) but receive a traceback error (please see attached metadata_error.txt) when trying the command:

$ mg-toolkit original_metadata -a ERP116156

With the bulk download command the result is no retrieved files but otherwise a standard looking output (please see bulk.txt attached)

I suspect that this might be due to my study being listed as confidential at present (the command appears to work fine on a public study), if this is the case, might there be a workaround (asides from making it public just yet)?

Any advice would be most appreciated.

Best wishes,

James

P.S. I am not sure if this constitutes an issue with the package per se and apologize if this is not the forum for such an post (I am new to github).

metadata_error.txt
bulk.txt

bulk download metadata

I see that the emg-toolkit can be used to download metadata for an individual study. Is there a way to bulk download metadata for metagenomes from all studies?

error running example

Trouble running the listed example:


$ conda create -n py3.6 python=3.6

$ conda activate py3.6

$ pip install -U mg-toolkit

$ mg-toolkit original_metadata -a ERP001736
Traceback (most recent call last):
  File "/anaconda2/envs/py3.6/bin/mg-toolkit", line 8, in <module>
    sys.exit(main())
  File "/anaconda2/envs/py3.6/lib/python3.6/site-packages/mg_toolkit/__init__.py", line 198, in main
    return getattr(mg_toolkit, args.tool)(args)
  File "/anaconda2/envs/py3.6/lib/python3.6/site-packages/mg_toolkit/metadata.py", line 46, in original_metadata
    om.save_to_csv(om.fetch_metadata())
  File "/anaconda2/envs/py3.6/lib/python3.6/site-packages/mg_toolkit/metadata.py", line 106, in fetch_metadata
    _meta = self.get_metadata(sample['sample_accession'])
  File "/anaconda2/envs/py3.6/lib/python3.6/site-packages/mg_toolkit/metadata.py", line 71, in get_metadata
    for m in x['ROOT']['SAMPLE']['SAMPLE_ATTRIBUTES']['SAMPLE_ATTRIBUTE']:
KeyError: 'ROOT'

Data not being found

I'm having the following problem when requesting download of certain accession ids for bulk download:

0%| | 0/9 [00:00<?, ?it/sERROR: HTTP Error 404: Not Found | 0/9 [00:00<?, ?it/s]
0%| | 0/9 [00:00<?, ?it/s]
0%| | 0/9 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/nfs/sw/ebi-metagenomics/ebi-metagenomics-0.6.5/python/bin/mg-toolkit", line 8, in
sys.exit(main())
File "/nfs/sw/ebi-metagenomics/ebi-metagenomics-0.6.5/python/lib/python3.8/site-packages/mg_toolkit/init.py", line 198, in main
return getattr(mg_toolkit, args.tool)(args)
File "/nfs/sw/ebi-metagenomics/ebi-metagenomics-0.6.5/python/lib/python3.8/site-packages/mg_toolkit/bulk_download.py", line 44, in bulk_download
program.run()
File "/nfs/sw/ebi-metagenomics/ebi-metagenomics-0.6.5/python/lib/python3.8/site-packages/mg_toolkit/bulk_download.py", line 213, in run
num_results_processed = self._process_page(res, progress_bar)
File "/nfs/sw/ebi-metagenomics/ebi-metagenomics-0.6.5/python/lib/python3.8/site-packages/mg_toolkit/bulk_download.py", line 253, in _process_page
self.download_file(
File "/nfs/sw/ebi-metagenomics/ebi-metagenomics-0.6.5/python/lib/python3.8/site-packages/mg_toolkit/bulk_download.py", line 163, in download_file
BulkDownloader.download_resource_by_url(
File "/nfs/sw/ebi-metagenomics/ebi-metagenomics-0.6.5/python/lib/python3.8/site-packages/mg_toolkit/bulk_download.py", line 125, in download_resource_by_url
urlretrieve(url, output_file_name)
File "/nfs/sw/python/python-3.8.3/lib/python3.8/urllib/request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/nfs/sw/python/python-3.8.3/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/nfs/sw/python/python-3.8.3/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/nfs/sw/python/python-3.8.3/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/nfs/sw/python/python-3.8.3/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/nfs/sw/python/python-3.8.3/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/nfs/sw/python/python-3.8.3/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Bulk download failing to download large numbers of files

bulk_download.py currently doesn't process all available files due to relatively simple count-iteration bug / typo. I made a pull request #14 that I think should fix the issue.

Example of run that failed: mg-toolkit -d bulk_download -p 5.0 -a MGYS00002401 -g taxonomic_analysis_ssu_rrna

Stops after downloading 150 files as num_results_processed is incremented cumulatively in each loop and 25 + 50 + 75 + 100 + 125 + 150 = 525 which is greater than the total number of files to download.

Can't download metadata, 500 error

I am trying to download study metadata and have a list of all the study secondary IDs in the form, ERP###### or SRP######.
When I execute the following commmand to get data from study ERP001736, I see the following error.

mg-toolkit -d original_metadata -a ERP001736
DEBUG: Accession ERP001736
DEBUG: Starting new HTTPS connection (1): www.ebi.ac.uk:443
DEBUG: https://www.ebi.ac.uk:443 "GET /ena/portal/api/search?result=read_run&query=study_accession%3DERP001736+OR+secondary_study_accession%3DERP001736&fields=run_accession%2Csecondary_sample_accession%2Csample_accession%2Cdepth&format=json HTTP/1.1" 500 10633
ERROR: Error decoding ENA sample_metadata response for accession: ERP001736

I was able to execute the same command a few days ago and it seemed to work, generating a .csv file of useful metadata. I have replicated my workflow the other day step by step, and have no idea what the problem is now. Feel free to let me know if this is a temporary server-side issue, or if there is any other command I can try.

dowload metadata error

Hello,
I'm trying to use mg-toolkit (version 0.10.0) to fetch metadata from a large Project PRJEB11419. After hours of execution I get the following error:
mg-toolkit_error

I have tried with other projects, and I have only been able to reproduce this error with this particular dataset.
Thanks in advance for any help that you can provide!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.