gbfetch

gbfetch makes it easy to download DNA sequences from GenBank directly into R.

You can provide a query string exactly as you would for a GenBank search, and it will return the corresponding sequences or metadata in a tidy format.

Installation

gbfetch is only available on GitHub, and must be installed with devtools or remotes.

devtools::install_github("joelnitta/gbfetch")

Dependencies

Most R packages gbfetch depends on will be automatically detected and installed from CRAN if needed. There are two exceptions.

jntools is only available on GitHub, and must be installed first.

devtools::install_github("joelnitta/jntools")

biofiles is not required to install gbfetch, but is a needed for assemble_gene_set(), so install it if you want to use this function. biofiles is only availble from BioConductor, and itself has dependencies that are only available from BioConductor.

BiocManager::install("biofiles")

GenBank API key

The fetch_sequences() and fetch_metadata() functions in this package use taxize and rentrez functions under the hood to fetch taxonomic data and GenBank sequences. Although not required, enabling the NCBI Entrez API key will increase the number of requests you are allowed to make per second. This should be set up using taxize::use_entrez() prior to running fetch_ functions or your query might get rejected and no sequences returned.

Large downloads

This package is meant for downloading moderate amounts of data, not whole swaths of GenBank. If you want to access large amounts of data, the restez or biomartr packages might be better options.

I am not aware of an upper limit on the number of sequences that can be downloaded, but I haven’t tried more than ca. 10,000 at a time (which takes just a few minutes with fetch_sequences()). fetch_metadata() takes significantly longer, because it uses taxize to get species names, which is rather slow. Obtaining the metadata for ca. 10,000 sequences might take up to an hour or so.

Trying to fetch too many sequenes might result in your IP address getting rejected by the API (at least temporarily), so use with caution!

Data format

Sequence data are stored using the DNAbin class from the ape package. If you work with BioConductor packages that use the DNAString class, they will need to be converted. I’m not aware of a as.DNAString() function that can do this, so I wrote one.

Similar work

rentrez (which gbfetch uses under the hood) has much more sophisticated capabilities for querying NCBI databases.

The ape package has a function, read.GenBank() that can read in a DNA sequence from GenBank given its accession number.

restez allows you to download entire chunks of GenBank for querying locally. This is almost certainly the way to go if you are interested in downloading lots of data.

biomartr is designed for downloading and working with whole genomes on GenBank.

phylotaR queries GenBank for a taxonomic group of interest and automatically assembles a set of orthologous loci for phylogenetic analysis.

PyPHLAWD is a Python pipeline for assembling a set of loci for phylogenetic analysis.

Examples

(See notes above about setting your API key before running these)

Fetch sequences and metadata

Download all rbcL sequences for Crepidomanes minutum. The query string is formatted exactly as if we were searching on GenBank.

Though you don’t need to do this in practice, I will use tictoc to give an idea of how long this query takes.

library(gbfetch)
library(tictoc) # Optional; for timing the example

query_string <- "Crepidomanes minutum[ORGN] AND rbcl[Gene]"

tic() # Set a timer

fetch_sequences(query_string)

## 99 DNA sequences in binary format stored in a list.
## 
## Mean sequence length: 1191.525 
##    Shortest sequence: 358 
##     Longest sequence: 1309 
## 
## Labels:
## KY099784
## EU338465
## EU122974
## EU122973
## EU122972
## AB574710
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.272 0.191 0.243 0.294 
## (Total: 117.96 kb)

toc() # See how long it took

## 2.193 sec elapsed

Download associated metadata for the sequences.

tic() # Set a timer

fetch_metadata(query_string)

## Registered S3 method overwritten by 'crul':
##   method                 from
##   as.character.form_file httr

## # A tibble: 99 x 8
##         gi accession taxid title       slen subtype     subname     species
##      <int> <chr>     <int> <chr>      <int> <chr>       <chr>       <chr>  
##  1  1.28e9 KY099784  32127 Crepidoma…  1309 isolate|sp… JNG0671|Ni… Crepid…
##  2  1.70e8 EU338465  32127 Crepidoma…  1238 specimen_v… J.Nitta041… Crepid…
##  3  1.62e8 EU122974  32127 Crepidoma…  1238 specimen_v… J.Nitta076… Crepid…
##  4  1.62e8 EU122973  32127 Crepidoma…  1238 specimen_v… J.Nitta028… Crepid…
##  5  1.62e8 EU122972  32127 Crepidoma…  1238 specimen_v… J.Nitta008… Crepid…
##  6  3.13e8 AB574710  32127 Crepidoma…  1205 specimen_v… TNS:762567… Crepid…
##  7  5.46e8 AB479140  32127 Crepidoma…  1206 specimen_v… personal:T… Crepid…
##  8  5.46e8 AB479139  32127 Crepidoma…  1206 specimen_v… personal:S… Crepid…
##  9  5.46e8 AB479138  32127 Crepidoma…  1206 specimen_v… personal:A… Crepid…
## 10  5.46e8 AB479137  32127 Crepidoma…  1206 specimen_v… personal:S… Crepid…
## # … with 89 more rows

toc()

## 25.967 sec elapsed

Assemble a set of genes from genomes

The number of whole or partial genomes in GenBank is increasing rapidly. Although fetch_sequnces() is useful for downloading small (e.g., single gene) sequences, we may also want to download multiple genes from a single genome or genomes. That is where fetch_gene_from_genome() comes in.

Let’s download three genes of interest from the Diplazium striatum plastome, which has GenBank accession number KY427346.

# KY427346 is the GenBank accession no. for the Diplazium striatum plastome

tic() # Set a timer

genes_to_get <- c("rbcL", "matK", "psbA")
fetch_gene_from_genome(genes_to_get, "KY427346")

## $rbcL
## 1 DNA sequence in binary format stored in a list.
## 
## Sequence length: 1428 
## 
## Label:
## KY427346-rbcL
## 
## Base composition:
##     a     c     g     t 
## 0.271 0.230 0.251 0.248 
## (Total: 1.43 kb)
## 
## $matK
## 1 DNA sequence in binary format stored in a list.
## 
## Sequence length: 1503 
## 
## Label:
## KY427346-matK
## 
## Base composition:
##     a     c     g     t 
## 0.310 0.187 0.192 0.311 
## (Total: 1.5 kb)
## 
## $psbA
## [1] "Multiple genes detected"

toc()

## 7.626 sec elapsed

Scaling up, assemble_gene_set() can assemble multiple genes of interest from multiple genomes into a list. Let’s get the same three genes for Cystopteris protrusa (GenBank accession no. KP136830) and Diplazium striatum.

# KY427346 is the GenBank accession no. for the Diplazium striatum plastome

tic() # Set a timer

assemble_gene_set(
  accessions = c("KP136830", "KY427346"), 
  genes = genes_to_get)

## [1] "These genes were duplicated within at least one accession and dropped: psbA"

## $rbcL
## 2 DNA sequences in binary format stored in a list.
## 
## All sequences of same length: 1428 
## 
## Labels:
## KP136830-rbcL
## KY427346-rbcL
## 
## Base composition:
##     a     c     g     t 
## 0.270 0.227 0.252 0.251 
## (Total: 2.86 kb)
## 
## $matK
## 2 DNA sequences in binary format stored in a list.
## 
## Mean sequence length: 1626 
##    Shortest sequence: 1503 
##     Longest sequence: 1749 
## 
## Labels:
## KP136830-matK
## KY427346-matK
## 
## Base composition:
##     a     c     g     t 
## 0.310 0.183 0.194 0.313 
## (Total: 3.25 kb)

toc()

## 6.044 sec elapsed

joelnitta / gbfetch Goto Github PK

gbfetch's Introduction

gbfetch

Installation

Dependencies

GenBank API key

Large downloads

Data format

Similar work

Examples

Fetch sequences and metadata

Assemble a set of genes from genomes

gbfetch's People

Contributors

Stargazers

Watchers

Forkers

gbfetch's Issues

Recommend Projects

Recommend Topics

Recommend Org