Coder Social home page Coder Social logo

seqseek's Introduction

SeqSeek Build Status

Easy access to human reference genome sequences.

This package calls open(file).seek(range) on FASTA files of ASCII characters to provide ranges of sequence strings. It is exactly as fast as your disk, for better or worse.

Requirements

  • Python 2.7+

Install

pip

$ pip install seqseek

Download Utilities

$ download_build_37 
$ download_build_38 

These commands check to see which chromosomes need to be downloaded, fetch any missing files, remove newline characters, and run build-specific integrity tests. The sequence files are downloaded from our Amazon S3 bucket which contains FASTA-formatted sequence files obtained from NCBI's nucleotide database (e.g. NC_000001.11).

Test Utilities

$ test_build_37
$ test_build_38

These commands run build specific tests to ensure the chromosome files have been downloaded correctly. These tests read sequences from each chromosome file and compare the extracted sequence with sequences pulled from https://genome.ucsc.edu.

Using the seqseek package

from seqseek import Chromosome

Import the chromosome class from the seqseek package.

Chromosome(17).sequence(141224, 141244) #=> TTTCCTGAGAGTTCCAGTGA

The command above will return a string of 20 nucleotides found between interbase positions 141224-141244 on chromosome 17. SeqSeek currently defaults to build 37 to match the coordinates used by the 23andMe website and raw data downloads.


from seqseek import Chromosome, BUILD37, BUILD38
Chromosome(17, assembly=BUILD38).sequence(141224, 141244) #=> ACCTGGTGAGGGGACATGGG

You can explicitly specify either build 37 or build 38 using the BUILD37 and BUILD38 constants and the assembly keyword argument.


Chromosome('NC_000017.11'').sequence(141224, 141244)  #=> ACCTGGTGAGGGGACATGGG

You can also load a chromosome directly by an accession name instead of specifying both the common name and the genome assembly.

The Mitochondria

The mitochondria is a circular piece of DNA and it is sometimes useful to retrieve sequences that extend beyond the min or max coordinates of the contig and loop back to the beginning or end. This is mainly useful for pulling flanking sequences for designing oligonucleotide probes near the extreme 3' and 5' regions of the mitochondria but there may be other applications as well.

We never return sequences that are longer than the length of the contig. Attempts to load such a sequence raise a TooManyLoops exception

This behavior can be requested by passing loop=True when loading the mitochondria by name. These two invocations return the same sequence:

Chromosome('MT', loop=True).sequence(-5, 5)         # negative start coordinate  
Chromosome('MT', loop=True).sequence(16564, 16574)  # out of bounds end coordinate

SeqSeek uses the revised Cambridge Reference Sequence (rCRS) for the mitochondria on both build 37 and 38. If you need access to the out-of-date RSRS sequence for backward-compatibility then you may load it directly by accession (NC_001807.4).

The rCRS mitochondria sequence contains an 'N' base at position 3106-3107 to preserve legacy nucleotide numbering. This can be useful for using legacy coordinates but but is impractical when working with sequences that are expected to align to observed human mitochondrial sequences. SeqSeek removes this N:

Chromosome('MT').sequence(3106, 3107)  # => ''
Chromosome('MT').sequence(3106, 3108)  # => 'T'

Supported chromosome names and accessions

SeqSeek uses the following common chromosome names: 1, 2, ..., 22, X, Y, and MT.

The full list of supported accessions is as follows:

  • NC_000001.10
  • NC_000001.11
  • NC_000002.11
  • NC_000002.12
  • NC_000003.11
  • NC_000003.12
  • NC_000004.11
  • NC_000004.12
  • NC_000005.9
  • NC_000005.10
  • NC_000006.11
  • NC_000006.12
  • NC_000007.13
  • NC_000007.14
  • NC_000008.10
  • NC_000008.11
  • NC_000009.11
  • NC_000009.12
  • NC_000010.10
  • NC_000010.11
  • NC_000011.9
  • NC_000011.10
  • NC_000012.11
  • NC_000012.12
  • NC_000013.10
  • NC_000013.11
  • NC_000014.8
  • NC_000014.9
  • NC_000015.9
  • NC_000015.10
  • NC_000016.9
  • NC_000016.10
  • NC_000017.10
  • NC_000017.11
  • NC_000018.9
  • NC_000018.10
  • NC_000019.9
  • NC_000019.10
  • NC_000020.10
  • NC_000020.11
  • NC_000021.8
  • NC_000021.9
  • NC_000022.10
  • NC_000022.11
  • NC_000023.10
  • NC_000023.11
  • NC_000024.10
  • NC_000024.9
  • NC_001807.4
  • NC_012920.1
  • NT_113891.2
  • NT_167244.1
  • NT_167245.1
  • NT_167246.1
  • NT_167247.1
  • NT_167248.1
  • NT_167249.1
  • NT_167250.1
  • NT_167251.1

seqseek's People

Contributors

matstrand avatar jon-elofson avatar

Watchers

Jesslyn Janssen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.