Coder Social home page Coder Social logo

ncbi_tutorial's Introduction

ncbi_tutorial

How to use the NCBI ressources from the command line and not be lost in the NCBI huge documentation

Get data from NCBI

Sequencing data are usually in a SRA format and can be fetched using the run number (usually starting with SRR). If you have the run number, go directly to the downloading step. If not, you need to have something which refer to a NCBI archive (BioSample number, Experiment number, project number, ...). The best is to use whatever you have and to search for it at https://www.ncbi.nlm.nih.gov/Traces/study/. If you need to download all SRA files of one project, use the project number (starts with PRJNA), the website will lead you to a page with all the samples within the project and you can from there download the SraRunTable.txt file by clicking on 'RunInfo Table'. This file contains all informations about each sample, including the run number which allows you to download the data in sra format. You can also get only the run number by clicking on 'Accession List', giving you the SRR_Acc_List. You can directly use the later in a while loop to download your data.

The 3 important fields for the search and the downloading of the SraRunTable.txt and SRR_Acc_List.txt are highlighted in yellow.

Download the data

There are different ways to download a sra file, you need in all case sratoolkit (should work on Linux and Mac OS X)

Using prefetch

Get directly the SRA file uploaded locally (per defaul in /home/user/ncbi/public/sra). Check https://www.biostars.org/p/159950/ if you have trouble with disk space quota or go directly to the method using wget.

while read name; do
	prefetch $name
done < SRR_Acc_List.txt

You still need to convert sra into fastq format.

For single-end or paired-end data

for i in *sra; do
	fastq-dump --split-3 $i
done

Note that you can directly get compressed fastq files by adding the argument --gzip to the fastq-dump command.

For paired-end data, 2 fastq files will be generated, each having 1 mate.

Using fastq-dump

Get the fastq files directly but it goes anyway through a sra cache file in /home/user/ncbi/public/sra so it remains problematic if you have limited space in your home directory. In case your home directory is limited in space, a quick fix to change the location of the cache files is described in https://standage.github.io/that-darn-cache-configuring-the-sra-toolkit.html

For single-end data

while read name; do
	fastq-dump --split-spot $name
done < SRR_Acc_List.txt

*NB: If --split-spot argument is used for paired-end data, the sequences of the mate reads will be grouped together in the single output fastq file.

For paired-end data

while read name; do
	fastq-dump --split-files $name
done < SRR_Acc_List.txt

For paired-end data, 2 fastq files will be generated, each having 1 mate.

Using wget

In case the home disk space is a problem. Wget is a solution (from https://www.biostars.org/p/159950/). One can directly download a SRA file using this ftp path: /sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6 characters of accession>/<accession>/<accession>.sra

Assuming you deal with SRR files:

while read i; do
  first_6_chars=$(echo $i | cut -c1-6)
  accession="${i%.*}"
  wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/${first_6_chars}/${accession}/${accession}.sra
done < SRR_Acc_List.txt

Then, you need to convert your sra files into fastq files:

for i in *sra; do
	fastq-dump --split-3 $i
done

Authors

  • Johan Zicola - Initial work - johanzi

License

This project is licensed under the MIT License - see the LICENSE file for details

ncbi_tutorial's People

Contributors

johanzi avatar

Stargazers

 avatar Suhaib Shekfeh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.