Coder Social home page Coder Social logo

varcode's Introduction

Tests Coverage Status PyPI PyPI downloads

Varcode

Varcode is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences.

Installation

You can install varcode using pip:

pip install varcode

You can install required reference genome data through PyEnsembl as follows:

# Downloads and installs the Ensembl releases (75 and 76)
pyensembl install --release 75 76

Example

import varcode

# Load TCGA MAF containing variants from their
variants = varcode.load_maf("tcga-ovarian-cancer-variants.maf")

print(variants)
### <VariantCollection from 'tcga-ovarian-cancer-variants.maf' with 6428 elements>
###  -- Variant(contig=1, start=69538, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=881892, ref=T, alt=G, genome=GRCh37)
###  -- Variant(contig=1, start=3389714, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=3624325, ref=G, alt=T, genome=GRCh37)
###  ...

# you can index into a VariantCollection and get back a Variant object
variant = variants[0]

# groupby_gene_name returns a dictionary whose keys are gene names
# and whose values are themselves VariantCollections
gene_groups = variants.groupby_gene_name()

# get variants which affect the TP53 gene
TP53_variants = gene_groups["TP53"]

# predict protein coding effect of every TP53 variant on
# each transcript of the TP53 gene
TP53_effects = TP53_variants.effects()

print(TP53_effects)
### <EffectCollection with 789 elements>
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R342*)
### -- ThreePrimeUTR(variant=chr17 g.7574003G>A, transcript_name=TP53-005, transcript_id=ENST00000420246)
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-002, transcript_id=ENST00000445888, effect_description=p.R342*)
### -- FrameShift(variant=chr17 g.7574030_7574030delG, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R333fs)
### ...

premature_stop_effect = TP53_effects[0]

print(str(premature_stop_effect.mutant_protein_sequence))
### 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMF'

print(premature_stop_effect.aa_mutation_start_offset)
### 341

print(premature_stop_effect.transcript)
### Transcript(id=ENST00000269305, name=TP53-001, gene_name=TP53, biotype=protein_coding, location=17:7571720-7590856)

print(premature_stop_effect.gene.name)
### 'TP53'

If you are looking for a quick start guide, you can check out this iPython book that demonstrates simple use cases of Varcode

Effect Types

Effect type Description
AlternateStartCodon Replace annotated start codon with alternative start codon (e.g. "ATG>CAG").
ComplexSubstitution Insertion and deletion of multiple amino acids.
Deletion Coding mutation which causes deletion of amino acid(s).
ExonLoss Deletion of entire exon, significantly disrupts protein.
ExonicSpliceSite Mutation at the beginning or end of an exon, may affect splicing.
FivePrimeUTR Variant affects 5' untranslated region before start codon.
FrameShiftTruncation A frameshift which leads immediately to a stop codon (no novel amino acids created).
FrameShift Out-of-frame insertion or deletion of nucleotides, causes novel protein sequence and often premature stop codon.
IncompleteTranscript Can't determine effect since transcript annotation is incomplete (often missing either the start or stop codon).
Insertion Coding mutation which causes insertion of amino acid(s).
Intergenic Occurs outside of any annotated gene.
Intragenic Within the annotated boundaries of a gene but not in a region that's transcribed into pre-mRNA.
IntronicSpliceSite Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations.
Intronic Variant occurs between exons and is unlikely to affect splicing.
NoncodingTranscript Transcript doesn't code for a protein.
PrematureStop Insertion of stop codon, truncates protein.
Silent Mutation in coding sequence which does not change the amino acid sequence of the translated protein.
SpliceAcceptor Mutation in the last two nucleotides of an intron, likely to affect splicing.
SpliceDonor Mutation in the first two nucleotides of an intron, likely to affect splicing.
StartLoss Mutation causes loss of start codon, likely result is that an alternate start codon will be used down-stream (possibly in a different frame).
StopLoss Loss of stop codon, causes extension of protein by translation of nucleotides from 3' UTR.
Substitution Coding mutation which causes simple substitution of one amino acid for another.
ThreePrimeUTR Variant affects 3' untranslated region after stop codon of mRNA.

Coordinate System

Varcode currently uses a "base counted, one start" genomic coordinate system, to match the Ensembl annotation database. We are planning to switch over to "space counted, zero start" (interbase) coordinates, since that system allows for more uniform logic (no special cases for insertions). To learn more about genomic coordinate systems, read this blog post.

varcode's People

Contributors

arahuja avatar armish avatar e5c avatar iskandr avatar jburos avatar jlkravitz avatar julia326 avatar leekaiinthesky avatar tavinathanson avatar timodonnell avatar tuomastik avatar y9c avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

varcode's Issues

varcode fails when it sees strange sequences in a vcf

I've been working with 1000 genomes vcfs and they have the following unusual alternate alleles scattered throughout, which understandably anger varcode:

22 51179178 BI_GS_DEL1_B4_P2897_43 A 100 PASS
22 51068654 ALU_umary_ALU_12538 G INS:ME:ALU 100 PASS

gives:
AttributeError: '_SV' object has no attribute 'sequence'

I removed these lines to get it working (~1000 in chr22) but it would be nice if varcode just skipped these rather than dying.

Do FrameShift (or StopGain) mutations affect splicing?

Naively, a frameshift can be fully translated into an amino acid sequence using the original exon boundaries of a transcript. However, since an early stop codon may trigger alternative splicing (http://www.sciencedirect.com/science/article/pii/S0968000408001436) we might not be able to rely on the given exon boundaries. Similar problem for StopGain mutations (but without the extra complexity of producing a novel amino acid sequence from a shifted ORF).

What to do with mutations that span the 5' UTR / CDS boundary?

What happens if we have a multi-nucleotide substitution which starts in the 5' UTR and bleeds over into the coding sequence? We've lost our start codon but might have another nearby...should we try to translate this using whatever strategy works for a StartLoss mutation? Do we categorize it differently since it also affected (at least) some of the Kozak sequence? What if in addition to changing the start codon it also changes some of the downstream codons?

vcf.load_vcf should provide an option to load all variants, regardless of whether filter is PASS

This is on the variants_know_about_reference branch. My vcf has thousands of variants but varcode loads only 567 of them.

>> print("varcode says: %d" %
      len(varcode.vcf.load_vcf("/Users/tim/Box Sync/pt189/variants/strelka-somatic-final-illumina-exome/all.somatic.snvs.vcf")))
>> ! wc -l /Users/tim/Box\ Sync/pt189/variants/strelka-somatic-final-illumina-exome/all.somatic.snvs.vcf

varcode says: 567
   19929 /Users/tim/Box Sync/pt189/variants/strelka-somatic-final-illumina-exome/all.somatic.snvs.vcf

Add unit tests from Golden Helix blog post

One of the authors of VarSeq wrote a blog post evaluating other annotation tools on a set of interesting edge case variants. We should figure out the ideal annotation in Varcode for each case and add them as test cases.

An argument for using == and not >= for requirements?

~/drive/work/repos/cancer/varcode $ pip install varcode
Requirement already satisfied (use --upgrade to upgrade): varcode in /Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7 in /Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages (from varcode)
Requirement already satisfied (use --upgrade to upgrade): pandas>=0.13.1 in /Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages (from varcode)
Downloading/unpacking pyensembl>=0.5.11 (from varcode)
  Downloading pyensembl-0.6.1.tar.gz (53kB): 53kB downloaded

So, even though varcode specifies 0.5.11, it downloads 0.6.1, which isn't actually compatible with varcode.

Namely:

/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/varcode/variant.pyc in reference_name(self)
    128     @property
    129     def reference_name(self):
--> 130         return self.ensembl.reference.reference_name
    131 
    132     def __str__(self):

AttributeError: 'EnsemblRelease' object has no attribute 'reference'

AssertionError: aa_ref and aa_alt can't both be empty string

v = Variant(contig=1, start=56962223, ref='C', alt='T', ensembl=ensembl)
v.effects()

throws that error.

  1. I don't know if that AssertionError is that useful from a user point of view, it's complaining about function parameters fairly deep in the code.
  2. I believe the error occurs much earlier, but I can't figure out a lot of the logic in in_frame_coding_effect

in_frame_coding_effect is called with

ref= G
alt= A
cds_offset= 935
sequence_from_start_codon= ATGCAAAACTACAAGTACGACAAAGCGATCGTCCCGGAGAGCAAGAACGGCGGCAGCCCGGCGCTCAACAACAACCCGAGGAGGAGCGGCAGCAAGCGGGTGCTGCTCATCTGCCTCGACCTCTTCTGCCTCTTCATGGCGGGCCTCCCCTTCCTCATCATCGAGACAAGCACCATCAAGCCTTACCACCGAGGGTTTTACTGCAATGATGAGAGCATCAAGTACCCACTGAAAACTGGTGAGACAATAAATGACGCTGTGCTCTGTGCCGTGGGGATCGTCATTGCCATCCTCGCGATCATCACGGGGGAATTCTACCGGATCTATTACCTGAAGAAGTCGCGGTCGACGATTCAGAACCCCTACGTGGCAGCACTCTATAAGCAAGTGGGCTGCTTCCTCTTTGGCTGTGCCATCAGCCAGTCTTTCACAGACATTGCCAAAGTGTCCATAGGGCGCCTGCGTCCTCACTTCTTGAGTGTCTGCAACCCTGATTTCAGCCAGATCAACTGCTCTGAAGGCTACATTCAGAACTACAGATGCAGAGGTGATGACAGCAAAGTCCAGGAAGCCAGGAAGTCCTTCTTCTCTGGCCATGCCTCCTTCTCCATGTACACTATGCTGTATTTGGTGCTATACCTGCAGGCCCGCTTCACTTGGCGAGGAGCCCGCCTGCTCCGGCCCCTCCTGCAGTTCACCTTGATCATGATGGCCTTCTACACGGGACTGTCTCGCGTATCAGACCACAAGCACCATCCCAGTGATGTTCTGGCAGGATTTGCTCAAGGAGCCCTGGTGGCCTGCTGCATAGTTTTCTTCGTGTCTGACCTCTTCAAGACTAAGACGACGCTCTCCCTGCCTGCCCCTGCTATCCGGAAGGAAATCCTTTCACCTGTGGACATTATTGACAGGAACAATCACCACAACATGATGTAGGTGCCACCCACCTCCTGAGCTGTTTTTGTAAAATGACTGCTGACAGCAAGTTCTTGCTGCTCTCCAATCTCATCAGACAGTAGAATGTAGGGAAAAACTTTTGCCCGACTGATTTTTAAAAAGGAAAAAAAAAATGTTTTACTATGTGGCCTTCCAAAATAGGTAGTGTTTGCCTATGTGGAAACAACAGCAAACTAACACCAAGTGCCAGAGTCCTGGATGTGCAATTGGTTTAAGTGTTCATGTTCTAGTGAACACAGCTTGTTCAGGAACAAACACTAAAACGATTTGAGTAGAGGCTGCTGGTCACCTTTTGTGACCTGAGGAATCCCAGGCCTGTGAGAAAAGCAAAAATTCACATTGCAGCACATGATGCCAGAAATAGCACTGAATCAAGAAAATAGCCATTGCGGAGCTGCCCTCTTGAGTCTTTCTGTCCATCCCATTCTATTCTGTACTGTGACTTTAGTTCAGGAAGTTTTGTTTTGTGTTTTAAATAAAAGGAAAGAGCAAGTTTGCTCAGTCAAGTGATCAGATCCCGAATCTAGATTCCCAGTTCTAAGGCCTTATCACCTCCCCTGCCCATAGGCCAACAACCATAGTTCCTCACATTAGTGATTAGCAGACTCTTTGTGGACGAGTGAATTTCACAAACAGCACAATTTCAGAAGAAATCGAGGGCACAGGCCCTCACTTTTCTTCTTTTGACGCATCATCCTGTGATGTTGAAATGTCAATTGCAGGATGCTGATGTTGTGCACGTCAATCACCGGGCACTTGCATACTCTTAGAAACAGTTCGACTTCGGTTACTGCCTTCTCCCTTGAAATCCTTGCTGCGTGCCCACCAGGATTTCCTGTGAGGGCCCAGGAATGAGCAAGGCATGGTCTGCCACCAGCTGACGGAAAGCAGCCTTCTGTACAACAGATGGGAGGGTGAAGGGGGCAGAATGAAAATCGAACCAACCTTTTAGCTGTTGCAAATCAGAAGGAGCCAGAGAAGCAGGCAGTCTCATGCATGAGAGGTTACCCTTCAGGATGACAGAGCTGAGGGTCTTTGTAGGAGTTGCTCTTGCTGTGTAAAGCACTATTGTCTTGGGGTTGAGCCCTAGGGCAGTTCTTGGTAGGTTCTGCTGGGCAGAACATATGGGTTAAATCTCGGTAGAGAGTTTCCCTCATCCTCTATCCGTAAGTGTCCTTCCATGCAAGGTCCCACTCTAGGTGATAGACAGGGACCCCTTCTACTGAACCTTTGAGGAAAGGAGGAAGGAAGAAATGCGTTTAGATCTTGGATGCAGACCTTTCAAAGGGTTAAATGTAACCATATGGATCAACCACATGCACATCCTTACTACAGAATCCGTCCTTTCATTTCAACTTATAGCAAGCTATGATTTTTATATATAAATATTATATAAATAATGTATAAAACATTAAAAGTTAACTATGTAAGATATTATTTCTGAAACAATTTAGCTATATCCACTATGATTATAAACTGTGTCTCGACCTGTGTTATTTACATTAGCTGCTTAAAAAAGCATTGAGTTAATTTTTTTAAATATCAACTAAAATATCATAGTTCTGTGGTAGACATTGTTTTATAATGAAATAACTGCAACTAGAGAAAACTGTATAAAAACATTAAATTGTCAGTATTTTTGTAAGGTTCCATTTTGTAAAGAGAATAATATTCAAAGACTTTTGTAGCATACAAAGTGAAAACTTGTATCTGCGAAACTATACTTGTATTAAATGTGCTTTTTAAATAAAAGCTCGTAACACAACTAATTAAGGACTTGCAA
transcript= Transcript(id=ENST00000371250, name=PPAP2B-001, gene_name=PPAP2B, location=1:56960419-57045241)
variant= Variant(contig=1, start=56962223, ref=C, alt=T, genome=GRCh37)

On L249:

 original_protein_subsequence = transcript.protein_sequence[
        first_ref_codon_index:last_ref_codon_index + 1]

is called as

 original_protein_subsequence = transcript.protein_sequence[
        311:311 + 1]

but transcript.protein_sequence[311] gives an IndexError, but the slice gives an empty string, so from that point forward original_protein_subsequence is an empty string.

Varcode version 0.3.10 cannot be imported when installed through pip

Looks like the current released version, for some reason, has a problem preventing it to be imported properly. Here is my setup that makes use of a clean virtualenv:

[arman@mac varcode_pg]$ pip list
distribute (0.7.3)
pip (7.0.3)
setuptools (17.1.1)
virtualenv (13.0.3)
[arman@mac varcode_pg]$ virtualenv venv_varcode
New python executable in venv_varcode/bin/python2.7
Also creating executable in venv_varcode/bin/python
Installing setuptools, pip, wheel...done.
[arman@mac varcode_pg]$ source venv_varcode/bin/activate
(venv_varcode)[arman@mac varcode_pg]$ pip install varcode
Collecting varcode
Collecting biopython (from varcode)
  Using cached biopython-1.65.tar.gz
Collecting pandas>=0.13.1 (from varcode)
  Using cached pandas-0.16.2-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting pysam>=0.8.3 (from varcode)
Collecting numpy<2.0,>=1.7 (from varcode)
  Using cached numpy-1.9.2-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting pyensembl<0.7.0,>=0.6.5 (from varcode)
Collecting memoized-property (from varcode)
Collecting pyvcf (from varcode)
Collecting pytz>=2011k (from pandas>=0.13.1->varcode)
  Using cached pytz-2015.4-py2.py3-none-any.whl
Collecting python-dateutil (from pandas>=0.13.1->varcode)
  Using cached python_dateutil-2.4.2-py2.py3-none-any.whl
Collecting datacache>=0.4.14 (from pyensembl<0.7.0,>=0.6.5->varcode)
Collecting mock>=1.0.1 (from pyensembl<0.7.0,>=0.6.5->varcode)
Collecting typechecks>=0.0.2 (from pyensembl<0.7.0,>=0.6.5->varcode)
Collecting distribute (from pyvcf->varcode)
Collecting six>=1.5 (from python-dateutil->pandas>=0.13.1->varcode)
  Using cached six-1.9.0-py2.py3-none-any.whl
Collecting requests>=2.5.1 (from datacache>=0.4.14->pyensembl<0.7.0,>=0.6.5->varcode)
  Using cached requests-2.7.0-py2.py3-none-any.whl
Collecting appdirs>=1.4.0 (from datacache>=0.4.14->pyensembl<0.7.0,>=0.6.5->varcode)
  Using cached appdirs-1.4.0-py2.py3-none-any.whl
Collecting progressbar33>=2.4 (from datacache>=0.4.14->pyensembl<0.7.0,>=0.6.5->varcode)
Requirement already satisfied (use --upgrade to upgrade): setuptools>=0.7 in ./venv_varcode/lib/python2.7/site-packages (from distribute->pyvcf->varcode)
Building wheels for collected packages: biopython
  Running setup.py bdist_wheel for biopython
  Complete output from command /Users/pinarman/Desktop/varcode_pg/venv_varcode/bin/python2.7 -c "import setuptools;__file__='/private/var/folders/vg/yw71g5v13153zr7g558fcllw0000gn/T/pip-build-1ufnEI/biopython/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /var/folders/vg/yw71g5v13153zr7g558fcllw0000gn/T/tmpcR4tm7pip-wheel-:
  running bdist_wheel
  running build
  running build_py

  Numerical Python (NumPy) is not installed.

  This package is required for many Biopython features.  Please install
  it before you install Biopython. You can install Biopython anyway, but
  anything dependent on NumPy will not work. If you do this, and later
  install NumPy, you should then re-install Biopython.

  You can find NumPy at http://www.numpy.org


  ----------------------------------------
  Failed building wheel for biopython
Failed to build biopython
Installing collected packages: biopython, pytz, six, python-dateutil, numpy, pandas, pysam, memoized-property, requests, appdirs, typechecks, progressbar33, datacache, mock, pyensembl, distribute, pyvcf, varcode
  Running setup.py install for biopython
Successfully installed appdirs-1.4.0 biopython-1.65 datacache-0.4.14 distribute-0.7.3 memoized-property-1.0.2 mock-1.0.1 numpy-1.9.2 pandas-0.16.2 progressbar33-2.4 pyensembl-0.6.9 pysam-0.8.3 python-dateutil-2.4.2 pytz-2015.4 pyvcf-0.6.7 requests-2.7.0 six-1.9.0 typechecks-0.0.2 varcode-0.3.10
(venv_varcode)[arman@mac varcode_pg]$ python
Python 2.7.10 (default, Jun 10 2015, 19:42:47) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import varcode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named varcode
>>> 

but when you clone the latest code from the repository and install it locally, it works fine:

(venv_varcode)[arman@mac varcode_pg]$ cd src/varcode/
.git/             CONTRIBUTING.md   README.md         dist/             test/             varcode.egg-info/ 
.gitignore        LICENSE           build/            setup.py          varcode/          
(venv_varcode)[arman@mac varcode_pg]$ cd src/varcode/
(venv_varcode)[arman@mac varcode]$ python setup.py 
(venv_varcode)[arman@mac varcode]$ git pull
Already up-to-date.
(venv_varcode)[arman@mac varcode]$ python setup.py install
No module named pypandoc
Failed to convert README.md from Markdown to reStructuredText
running install
running bdist_egg
running egg_info
writing requirements to varcode.egg-info/requires.txt
writing varcode.egg-info/PKG-INFO
writing top-level names to varcode.egg-info/top_level.txt
writing dependency_links to varcode.egg-info/dependency_links.txt
reading manifest file 'varcode.egg-info/SOURCES.txt'
writing manifest file 'varcode.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.10-x86_64/egg
running install_lib
running build_py
creating build/bdist.macosx-10.10-x86_64/egg
creating build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/__init__.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/common.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/data.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/read_evidence_performance.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_collection_base_class.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_collection_filtering.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_common.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_cosmic_mutations.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_dbnsfp_validation.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_effect_classes.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_maf.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_mutate.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_no_duplicate_variants.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_problematic_variants.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_read_evidence.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_string_helpers.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_timings.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_variant.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_variant_collection.py -> build/bdist.macosx-10.10-x86_64/egg/test
copying build/lib/test/test_vcf.py -> build/bdist.macosx-10.10-x86_64/egg/test
creating build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/__init__.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/coding_effect.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/collection.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/common.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/cufflinks.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/effect_collection.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/effect_helpers.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/effect_ordering.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/effects.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/frameshift_coding_effect.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/in_frame_coding_effect.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/locus.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/maf.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/mutate.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/nucleotides.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
creating build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence
copying build/lib/varcode/read_evidence/__init__.py -> build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence
copying build/lib/varcode/read_evidence/pileup.py -> build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence
copying build/lib/varcode/read_evidence/pileup_collection.py -> build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence
copying build/lib/varcode/read_evidence/pileup_element.py -> build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence
copying build/lib/varcode/read_evidence/util.py -> build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence
copying build/lib/varcode/reference_name.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/string_helpers.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/transcript_helpers.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/translate.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/util.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/variant.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/variant_collection.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
copying build/lib/varcode/vcf.py -> build/bdist.macosx-10.10-x86_64/egg/varcode
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/common.py to common.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/data.py to data.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/read_evidence_performance.py to read_evidence_performance.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_collection_base_class.py to test_collection_base_class.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_collection_filtering.py to test_collection_filtering.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_common.py to test_common.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_cosmic_mutations.py to test_cosmic_mutations.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_dbnsfp_validation.py to test_dbnsfp_validation.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_effect_classes.py to test_effect_classes.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_maf.py to test_maf.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_mutate.py to test_mutate.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_no_duplicate_variants.py to test_no_duplicate_variants.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_problematic_variants.py to test_problematic_variants.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_read_evidence.py to test_read_evidence.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_string_helpers.py to test_string_helpers.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_timings.py to test_timings.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_variant.py to test_variant.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_variant_collection.py to test_variant_collection.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/test/test_vcf.py to test_vcf.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/coding_effect.py to coding_effect.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/collection.py to collection.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/common.py to common.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/cufflinks.py to cufflinks.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/effect_collection.py to effect_collection.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/effect_helpers.py to effect_helpers.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/effect_ordering.py to effect_ordering.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/effects.py to effects.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/frameshift_coding_effect.py to frameshift_coding_effect.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/in_frame_coding_effect.py to in_frame_coding_effect.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/locus.py to locus.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/maf.py to maf.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/mutate.py to mutate.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/nucleotides.py to nucleotides.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence/pileup.py to pileup.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence/pileup_collection.py to pileup_collection.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence/pileup_element.py to pileup_element.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/read_evidence/util.py to util.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/reference_name.py to reference_name.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/string_helpers.py to string_helpers.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/transcript_helpers.py to transcript_helpers.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/translate.py to translate.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/util.py to util.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/variant.py to variant.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/variant_collection.py to variant_collection.pyc
byte-compiling build/bdist.macosx-10.10-x86_64/egg/varcode/vcf.py to vcf.pyc
creating build/bdist.macosx-10.10-x86_64/egg/EGG-INFO
copying varcode.egg-info/PKG-INFO -> build/bdist.macosx-10.10-x86_64/egg/EGG-INFO
copying varcode.egg-info/SOURCES.txt -> build/bdist.macosx-10.10-x86_64/egg/EGG-INFO
copying varcode.egg-info/dependency_links.txt -> build/bdist.macosx-10.10-x86_64/egg/EGG-INFO
copying varcode.egg-info/requires.txt -> build/bdist.macosx-10.10-x86_64/egg/EGG-INFO
copying varcode.egg-info/top_level.txt -> build/bdist.macosx-10.10-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
test.__init__: module references __file__
test.data: module references __file__
creating 'dist/varcode-0.3.11-py2.7.egg' and adding 'build/bdist.macosx-10.10-x86_64/egg' to it
removing 'build/bdist.macosx-10.10-x86_64/egg' (and everything under it)
Processing varcode-0.3.11-py2.7.egg
creating /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages/varcode-0.3.11-py2.7.egg
Extracting varcode-0.3.11-py2.7.egg to /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Adding varcode 0.3.11 to easy-install.pth file

Installed /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages/varcode-0.3.11-py2.7.egg
Processing dependencies for varcode==0.3.11
Searching for pysam==0.8.3
Best match: pysam 0.8.3
Adding pysam 0.8.3 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for memoized-property==1.0.2
Best match: memoized-property 1.0.2
Adding memoized-property 1.0.2 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for PyVCF==0.6.7
Best match: PyVCF 0.6.7
Adding PyVCF 0.6.7 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for biopython==1.65
Best match: biopython 1.65
Adding biopython 1.65 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for pyensembl==0.6.9
Best match: pyensembl 0.6.9
Adding pyensembl 0.6.9 to easy-install.pth file
Installing pyensembl script to /Users/pinarman/Desktop/varcode_pg/venv_varcode/bin

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for pandas==0.16.2
Best match: pandas 0.16.2
Adding pandas 0.16.2 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for numpy==1.9.2
Best match: numpy 1.9.2
Adding numpy 1.9.2 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for distribute==0.7.3
Best match: distribute 0.7.3
Adding distribute 0.7.3 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for typechecks==0.0.2
Best match: typechecks 0.0.2
Adding typechecks 0.0.2 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for mock==1.0.1
Best match: mock 1.0.1
Adding mock 1.0.1 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for datacache==0.4.14
Best match: datacache 0.4.14
Adding datacache 0.4.14 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for python-dateutil==2.4.2
Best match: python-dateutil 2.4.2
Adding python-dateutil 2.4.2 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for pytz==2015.4
Best match: pytz 2015.4
Adding pytz 2015.4 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for setuptools==17.0
Best match: setuptools 17.0
Adding setuptools 17.0 to easy-install.pth file
Installing easy_install-3.4 script to /Users/pinarman/Desktop/varcode_pg/venv_varcode/bin
Installing easy_install script to /Users/pinarman/Desktop/varcode_pg/venv_varcode/bin

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for progressbar33==2.4
Best match: progressbar33 2.4
Adding progressbar33 2.4 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for appdirs==1.4.0
Best match: appdirs 1.4.0
Adding appdirs 1.4.0 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for requests==2.7.0
Best match: requests 2.7.0
Adding requests 2.7.0 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Searching for six==1.9.0
Best match: six 1.9.0
Adding six 1.9.0 to easy-install.pth file

Using /Users/pinarman/Desktop/varcode_pg/venv_varcode/lib/python2.7/site-packages
Finished processing dependencies for varcode==0.3.11
(venv_varcode)[arman@mac varcode]$ python
Python 2.7.10 (default, Jun 10 2015, 19:42:47) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import varcode
>>> 

Any plans for a public release of v0.3.11?

KeyError: 'reference' in load_vcf

varcode.vcf.load_vcf gives an error:

     61     if not ensembl_release:
     62         reference_path = vcf_reader.metadata[reference_path_field]
---> 63         reference_name = infer_reference_name(reference_path)
     64         ensembl_release = ensembl_release_number_for_reference_name(
     65                 reference_name)

KeyError: 'reference'

This occurs when ensembl_release is not specified and the VCF does not list the reference. Maybe its easier to require the reference or ensembl release?

If not, this can be wrapped in a more informative error as well

Let me know if you are interested in me adding either change.

Add unit tests for "Splicing analysis of 14 BRCA1 missense variants" by Ahlborn et al.

Results from a validated mini-gene splicing assay indicated that nine BRCA1 variants resulted in splicing aberrations leading to truncated transcripts and thus can be considered pathogenic (c.4987A>T/p.Met1663Leu, c.4988T>A/p.Met1663Lys, c.5072C>T/p.Thr1691Ile, c.5074G>C/p.Asp1692His, c.5074G>A/p.Asp1692Asn, c.5074G>T/p.Asp1692Tyr, c.5332G>A/p.Asp1778Asn, c.5332G>T/p.Asp1778Tyr, and c.5408G>C/p.Gly1803Ala), whereas five BRCA1 variants had no effect on splicing (c.4985T>C/p.Phe1662Ser, c.5072C>A/p.Thr1691Lys, c.5153G>C/p.Trp1718Ser, c.5154G>T/p.Trp1718Cys, and c.5333A>G/p.Asp1778Gly). Eight of the variants having an effect on splicing (c.4987A>T/p.Met1663Leu, c.4988T>A/p.Met1663Lys, c.5074G>C/p.Asp1692His, c.5074G>A/p.Asp1692Asn, c.5074G>T/p.Asp1692Tyr, c.5332G>A/p.Asp1778Asn, c.5332G>T/p.Asp1778Tyr, and c.5408G>C/p.Gly1803Ala) were previously determined to have no or an uncertain effect on the protein level, whereas one variant (c.5072C>T/p.Thr1691Ile) were shown to have a strong effect on the protein level as well.

http://link.springer.com/article/10.1007/s10549-015-3313-7

handle single-sample VCFs with INFO fields containing list values of size > 1

VariantCollection (in the flatten_info_dictionary function) assumes that all INFO fields are lists of size 1. However, GATK HaplotypeCaller, running on a single sample, appears to output VCFs that have INFO values with > 1 elements. For example, the "MLEAC" INFO element is per-allele, not per-sample: "Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed".

Currently, when loading a HaplotypeCaller VCF, we get a assertion error in flatten_info_dictionary. Probably the fix is to not attempt to flatten the dictionary.

Incorrect handling of variants which run past the beginning/end of an exon's boundary

Exon ENSE00002258257 of EVI5-201 runs from 92,979,093 to 92,974,260 on the reverse strand of chr1. A deletion which starts before the end of the exon causes an exception from spliced_offsets in PyEnsembl. Need to check if start_pos, end_pos in core_logic overlap any exon, not just the start/end of the whole transcript.

WARNING:root:Encountered error annotating Variant(contig=1, pos=92979092, ref=ATATATATATATATATATATATATATATATATG, alt=A) for Transcript(id=ENST00000540033, name=EVI5-201, gene_name=EVI5): Couldn't find position 92979124 on any exon of ENST00000540033

support determining the evidence for a variant in a bam

This is basically what I'm working on now, but I thought I'd file a ticket since I'm not sure whether people will think it belongs in varcode.

I need functionality for taking a varcode Variant and a BAM (e.g. a pysam alignment object), and giving me some summary of what reads support / don't support that variant in the BAM. Note that my BAM might be a different BAM than what the original variant caller was run on.

Currently I'm imagining a varcode module called something like VariantEvidence that has a routine for taking a Variant and a pysam alignment object and gives me back some info (yet to be determined exactly what). I'd be fine to put this in a different project though if people have other suggestions.

Longer indels in random variants

As of #45 util.random_variants will only generate insertions of length 1 or 2, and deletions of length 1. Would be good to random choose a length up to e.g. 1..10

Make `priority` a computed property of effects

Currently all the effects have to statically sorted by their class in EffectOrdering, which leads to an artificial proliferation of overlapping effect classes like Substitution, Insertion, Deletion, ComplexSubstitution. Really, all of those should be specific run-time instantiations of a single Substitution effect. To give single residue substitutions a lower priority then e.g. deletions, we should have a dynamic effect priority like:

def priority(self):
    flags = [
        self.changes_protein_length(),
        self.predictable_coding_effect()   
        self.overlaps_transcript()
        self.overlaps_gene(),
    ]
    total = 0
    for i, flag in enumerate(flags):
        total += 2 ** i * flag
    return total

investigate porting read evidence module to use impala

CC'ing @ihodes and @ryan-williams

My pysam-based varcode.read_evidence module seems to cap out at making pileups for about 100 loci / second. This is true for both NFS access and local disk access. My current guess is that pysam is rate limited by stepping through all the reads that don't match my locus of interest, but do fall within the same index bucket as my locus. If that's true, then this is essentially the speed limit for random access into BAM files.

I would be very interested to see if it's possible to get impala to generate pileups much faster than 100 pileups / sec. If it is, it would be great to port my read_evidence module to support an impala backend.

Reference amino acid sequence sometimes empty for coding variants

Encountered the following errors while processing a large VCF file:

WARNING:root:Encountered error annotating Variant(contig=1, pos=145296372, ref=A, alt=G) for Transcript(id=ENST00000448873, name=NBPF10-002, gene_name=NBPF10): len(aa_ref) = 0 for variant Variant(contig=1, pos=145296372, ref=A, alt=G) on transcript Transcript(id=ENST00000448873, name=NBPF10-002, gene_name=NBPF10) (aa_pos=63:63)

WARNING:root:Encountered error annotating Variant(contig=1, pos=145296372, ref=A, alt=G) for Transcript(id=ENST00000464433, name=NBPF10-005, gene_name=NBPF10): len(aa_ref) = 0 for variant Variant(contig=1, pos=145296372, ref=A, alt=G) on transcript Transcript(id=ENST00000464433, name=NBPF10-005, gene_name=NBPF10) (aa_pos=63:63)

WARNING:root:Encountered error annotating Variant(contig=1, pos=167385324, ref=TAA, alt=T) for Transcript(id=ENST00000420254, name=POU2F1-201, gene_name=POU2F1): len(aa_ref) = 0 for variant Variant(contig=1, pos=167385324, ref=TAA, alt=T) on transcript Transcript(id=ENST00000420254, name=POU2F1-201, gene_name=POU2F1) (aa_pos=653:653)

Add unit tests for known BRCA1 exonic splice site mutations from Wappenschmidt et. al.

We investigated 30 distinct BRCA1 variants, both intronic and exonic, regarding their spliceogenic potential by commonly used in silico prediction algorithms (HSF, MaxEntScan) along with in vitro transcript analyses. A total of 25 variants were identified spliceogenic, either causing/enhancing exon skipping or activation of cryptic splice sites, or both. Except from a single intronic variant causing minor effects on BRCA1 pre-mRNA processing in our analyses, 23 out of 24 intronic variants were correctly predicted by MaxEntScan, while HSF was less accurate in this cohort. Among the 6 exonic variants analyzed, 4 severely impair correct pre-mRNA processing, while the remaining two have partial effects. In contrast to the intronic alterations investigated, only half of the spliceogenic exonic variants were correctly predicted by HSF and/or MaxEntScan.

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0050800

add a memoized "highest_priority_effect" property to Variant

I'm often working with the highest priority effect of a variant, which is:

  • somewhat verbose to access: variant.effects().highest_priority_effect
  • does not appear to be cached, and is noticeably slow even when only working with a few hundred variants

support filtering a VariantCollection according to an intervals list file

For some of our analyses it would be helpful to be able to filter a VariantCollection to only those variants that fall within the intended capture targets.

Example intervals list:

[odonnt02@minerva4 ~]$ head /sc/orga/projects/ngs/resources/captures/2.3/Human_All_Exon_V5.hg19.interval_list
@HD     VN:1.4  SO:unsorted
@SQ     SN:chrM LN:16571        UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ     SN:chr1 LN:249250621    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ     SN:chr2 LN:243199373    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:a0d9851da00400dec1098a9255ac712e
@SQ     SN:chr3 LN:198022430    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:641e4338fa8d52a5b781bd2a2c08d3c3
@SQ     SN:chr4 LN:191154276    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:23dccd106897542ad87d2765d28a19a1
@SQ     SN:chr5 LN:180915260    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:0740173db9ffd264d728f32784845cd7
@SQ     SN:chr6 LN:171115067    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:1d3a93a248d92a729ee764823acbbc6b
@SQ     SN:chr7 LN:159138663    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:618366e953d6aaad97dbe4777c29375e
@SQ     SN:chr8 LN:146364022    UR:file:/gs01/projects/ngs/resources/gatk/2.3/ucsc.hg19.parmasked.fasta M5:96f514a9929e410c6651697bded59aec

GATK also supports a few other formats (probably not needed here though): https://www.broadinstitute.org/gatk/guide/article?id=1319

assertion error in infer_coding_effect

I can update this later with better debug info if it's hard to reproduce.

This is coming from one of these VCFs on box:

[u'/demeter/users/odonnt02/box//pt189/variants/germline-haplotypecaller-no-interval/output.raw.snps.indels.vcf',
 u'/demeter/users/odonnt02/box//pt189/variants/strelka-somatic-primary-illumina-exome/all.somatic.snvs.vcf',
 u'/demeter/users/odonnt02/box//pt189/variants/mutect-somatic-primary-illumina-exome/PT189_Left_Ovary_Exome.PT189_Left_Ovary_Exome.mutect.vcf',
 u'/demeter/users/odonnt02/box//pt189/variants/strelka-somatic-final-illumina-exome/all.somatic.snvs.vcf',
 u'/demeter/users/odonnt02/box//pt189/variants/mutect-somatic-final-illumina-exome/PT189_6_14_Exome.PT189_6_14_Exome.mutect.vcf']

I suspect it's the germline one (the first in that list).

I'm getting the following error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-89-0b949991dffb> in <module>()
     12 variants_pass = variant_to_resources.keys()
     13 variants_pass = [v for v in variants_pass if not v.info["FILTER"]]
---> 14 variants_pass = [v for (v,r) in variant_to_resources.items() if len(v.ref) == 1 and len(v.alt) == 1 and v.effects(only_coding_transcripts=True)]
     15 print("%d -> %d" % (len(variant_to_resources), len(variants_pass)))
     16 variants_pass = [v for (v,r) in variant_to_resources.items() if r != ['vcf_germline_haplotypecaller'] or any(v.gene_names() in interesting_genes)]

/demeter/users/odonnt02/sinai/git/varcode/varcode/variant.pyc in effects(self, only_coding_transcripts, raise_on_error)
    212             for transcript in transcripts:
    213                 try:
--> 214                     effects.append(self.transcript_effect(transcript))
    215                 except (AssertionError, ValueError) as error:
    216                     if raise_on_error:

/demeter/users/odonnt02/sinai/git/varcode/varcode/variant.pyc in transcript_effect(self, transcript)
    266 
    267         # TODO: exonic splice site mutations
--> 268         return self._exonic_transcript_effect(transcript)
    269 
    270 

/demeter/users/odonnt02/sinai/git/varcode/varcode/variant.pyc in _exonic_transcript_effect(self, transcript)
    384             cds_offset,
    385             variant=self,
--> 386             transcript=transcript)
    387 

/demeter/users/odonnt02/sinai/git/varcode/varcode/coding_effect.pyc in infer_coding_effect(ref, alt, cds_offset, transcript, variant)
    175         assert len(aa_ref) > 0, \
    176             "len(aa_ref) = 0 for variant %s on transcript %s (aa_pos=%d:%d)" % (
--> 177                 variant, transcript, aa_pos, last_aa_ref_pos)
    178 
    179     # is this a premature stop codon?

AssertionError: len(aa_ref) = 0 for variant Variant(contig=1, pos=145296372, ref=A, alt=G, genome=GRCh37) on transcript Transcript(id=ENST00000448873, name=NBPF10-002, gene_name=NBPF10) (aa_pos=63:63)

modifies_coding_sequence is always false

Is

    @property
    def modifies_coding_sequence(self):
        """It's convenient to have a property which tells us:
            1) is this a variant effect overlapping a transcript?
            2) does that transcript have a coding sequence?
            3) does the variant affect the coding sequence?
        """
        return False

ever overridden? I can't find it, so it seems like it is always false?

Example - substitution should return True, correct?

print effect
print effect.modifies_coding_sequence

output

Substitution(variant=chr1 g.69532C>T, transcript=OR4F5-001, effect_description=p.H148Y)
False

Too many open files on error on getting top effect

In [91]:
len(variant_collection)
Out[91]:
82655
vc_effect = [v.top_effect() for v in variant_collection]

gives

Traceback (most recent call last):
  File "/Users/arahuja/anaconda/lib/python2.7/site-packages/IPython/core/ultratb.py", line 970, in get_records
  File "/Users/arahuja/anaconda/lib/python2.7/site-packages/IPython/core/ultratb.py", line 233, in wrapped
  File "/Users/arahuja/anaconda/lib/python2.7/site-packages/IPython/core/ultratb.py", line 267, in _fixed_getinnerframes
  File "/Users/arahuja/anaconda/lib/python2.7/inspect.py", line 1044, in getinnerframes
  File "/Users/arahuja/anaconda/lib/python2.7/inspect.py", line 1004, in getframeinfo
  File "/Users/arahuja/anaconda/lib/python2.7/inspect.py", line 454, in getsourcefile
  File "/Users/arahuja/anaconda/lib/python2.7/inspect.py", line 483, in getmodule
  File "/Users/arahuja/anaconda/lib/python2.7/inspect.py", line 467, in getabsfile
  File "/Users/arahuja/anaconda/lib/python2.7/posixpath.py", line 371, in abspath
OSError: [Errno 24] Too many open files
INFO:root:Cached file Homo_sapiens.GRCh37.75.gtf from URL ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
INFO:root:Cached file Homo_sapiens.GRCh37.75.cdna.all.fa from URL ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz

messages printed ~2000 times.

I'm pretty sure this is due to how I constructed my VariantCollection

variant_collection = VariantCollection([Variant(contig = row['Chrom'], 
                                                start=row['Pos'], 
                                                ref=row['Ref'],
                                                alt = row['Alt'].split(',')[0],
                                                ensembl=EnsemblRelease(75))
                                            for i, row in snv_data.iterrows()])

vs

ensembl = EnsemblRelease(75)
variant_collection = VariantCollection([Variant(contig = row['Chrom'], 
                                                start=row['Pos'], 
                                                ref=row['Ref'],
                                                alt = row['Alt'].split(',')[0],
                                                ensembl=ensembl)
                                            for i, row in snv_data.iterrows()])

This is mostly just user error, but perhaps there is a way to make this simpler.

AttributeError: 'FrameShiftTruncation' object has no attribute 'aa_alt'

Issue with multiple inheritance of FrameShiftTruncation, where mutant_protein_sequence uses the definition in PrematureStop?

/Users/tavi/drive/work/repos/cancer/topiary/topiary/mutant_epitope_predictor.pyc in predict(self, variant_collection, mhc_alleles, gene_expression_dict, min_gene_expresion_value, transcript_expression_dict, min_transcript_expresion_value, raise_on_variant_effect_error)
    145         fasta_dict = create_fasta_dict(
    146             effects=variant_effects.values(),
--> 147             padding_around_mutation=self.padding_around_mutation)
    148         return mhc_model.predict(fasta_dict)

/Users/tavi/drive/work/repos/cancer/topiary/topiary/mutant_epitope_predictor.pyc in create_fasta_dict(effects, padding_around_mutation)
     20         # TODO: will mhctools take an object key?
     21         key = effect.variant
---> 22         seq = effect.mutant_protein_sequence
     23         # some effects will lack a mutant protein sequence since
     24         # they are either silent or unpredictable

/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/memoized_property.pyc in fget_memoized(self)
     38     def fget_memoized(self):
     39         if not hasattr(self, attr_name):
---> 40             setattr(self, attr_name, fget(self))
     41         return getattr(self, attr_name)
     42 

/Users/tavi/drive/work/repos/cancer/varcode/varcode/effects.py in mutant_protein_sequence(self)
    546     def mutant_protein_sequence(self):
    547         prefix = self.original_protein_sequence[:self.aa_mutation_start_offset]
--> 548         return prefix + self.aa_alt
    549 
    550 

AttributeError: 'FrameShiftTruncation' object has no attribute 'aa_alt'

compare variants that use different references

Not sure if this is really possible. It would great to be able to compare a variant (or locus perhaps) aligned against hg18 with one aligned with grch37 and have the result be True if the variants are in logically the same place in the genome.

Just hit an issue now where I'm realizing the TCGA MAF files use all sorts of different references so considering only genomic position for the variants in them results in a set of variants that appear different but are really referring to the same event. Having some way to work coherently with a set of variants aligned to different references would be really useful.

support deep reloading varcode module

It's a small but constant annoyance to not be able to deep reload the varcode module (e.g. in a long running IPython session), due some issue with pyfaidx. Instead, when modifying varcode, I have to restart the IPython kernel with each update.

Perhaps there is some way to isolate the faidx module in a way that allows the varcode module to be reloaded.

>> dreload(varcode)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-10-9d47e937d853> in <module>()

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in reload(module, exclude)
    339     try:
    340         with replace_import_hook(deep_import_hook):
--> 341             return deep_reload_hook(module)
    342     finally:
    343         found_now = {}

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in deep_reload_hook(m)
    309 
    310     try:
--> 311         newm = imp.load_module(name, fp, filename, stuff)
    312     except:
    313          # load_module probably removed name from modules because of

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_module(name, file, filename, details)
    233         raise ValueError(msg)
    234     elif type_ == PY_SOURCE:
--> 235         return load_source(name, filename, file)
    236     elif type_ == PY_COMPILED:
    237         return load_compiled(name, filename, file)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_source(name, pathname, file)
    167     methods = _SpecMethods(spec)
    168     if name in sys.modules:
--> 169         module = methods.exec(sys.modules[name])
    170     else:
    171         module = methods.load()

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec_module(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/Users/tim/sinai/git/varcode/varcode/variant_collection.py in <module>()
     19 from .effects import Substitution
     20 from .effect_ordering import effect_priority, transcript_effect_priority_dict
---> 21 from .variant import Variant
     22 from . import type_checks
     23 

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in deep_import_hook(name, globals, locals, fromlist, level)
    250     parent, buf = get_parent(globals, level)
    251 
--> 252     head, name, buf = load_next(parent, None if level < 0 else parent, name, buf)
    253 
    254     tail = head

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in load_next(mod, altmod, name, buf)
    154     buf += subname
    155 
--> 156     result = import_submodule(mod, subname, buf)
    157     if result is None and mod != altmod:
    158         result = import_submodule(altmod, subname, subname)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in import_submodule(mod, subname, fullname)
    199 
    200         try:
--> 201             m = imp.load_module(fullname, fp, filename, stuff)
    202         except:
    203             # load_module probably removed name from modules because of

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_module(name, file, filename, details)
    233         raise ValueError(msg)
    234     elif type_ == PY_SOURCE:
--> 235         return load_source(name, filename, file)
    236     elif type_ == PY_COMPILED:
    237         return load_compiled(name, filename, file)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_source(name, pathname, file)
    167     methods = _SpecMethods(spec)
    168     if name in sys.modules:
--> 169         module = methods.exec(sys.modules[name])
    170     else:
    171         module = methods.load()

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec_module(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/Users/tim/sinai/git/varcode/varcode/variant.py in <module>()
     16 
     17 from . import type_checks
---> 18 from .coding_effect import infer_coding_effect
     19 from .common import group_by
     20 from .nucleotides import normalize_nucleotide_string

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in deep_import_hook(name, globals, locals, fromlist, level)
    250     parent, buf = get_parent(globals, level)
    251 
--> 252     head, name, buf = load_next(parent, None if level < 0 else parent, name, buf)
    253 
    254     tail = head

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in load_next(mod, altmod, name, buf)
    154     buf += subname
    155 
--> 156     result = import_submodule(mod, subname, buf)
    157     if result is None and mod != altmod:
    158         result = import_submodule(altmod, subname, subname)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in import_submodule(mod, subname, fullname)
    199 
    200         try:
--> 201             m = imp.load_module(fullname, fp, filename, stuff)
    202         except:
    203             # load_module probably removed name from modules because of

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_module(name, file, filename, details)
    233         raise ValueError(msg)
    234     elif type_ == PY_SOURCE:
--> 235         return load_source(name, filename, file)
    236     elif type_ == PY_COMPILED:
    237         return load_compiled(name, filename, file)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_source(name, pathname, file)
    167     methods = _SpecMethods(spec)
    168     if name in sys.modules:
--> 169         module = methods.exec(sys.modules[name])
    170     else:
    171         module = methods.load()

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec_module(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/Users/tim/sinai/git/varcode/varcode/coding_effect.py in <module>()
     28     FrameShiftTruncation
     29 )
---> 30 from .string_helpers import trim_shared_flanking_strings
     31 
     32 from Bio.Seq import Seq, CodonTable

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in deep_import_hook(name, globals, locals, fromlist, level)
    250     parent, buf = get_parent(globals, level)
    251 
--> 252     head, name, buf = load_next(parent, None if level < 0 else parent, name, buf)
    253 
    254     tail = head

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in load_next(mod, altmod, name, buf)
    154     buf += subname
    155 
--> 156     result = import_submodule(mod, subname, buf)
    157     if result is None and mod != altmod:
    158         result = import_submodule(altmod, subname, subname)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in import_submodule(mod, subname, fullname)
    199 
    200         try:
--> 201             m = imp.load_module(fullname, fp, filename, stuff)
    202         except:
    203             # load_module probably removed name from modules because of

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_module(name, file, filename, details)
    233         raise ValueError(msg)
    234     elif type_ == PY_SOURCE:
--> 235         return load_source(name, filename, file)
    236     elif type_ == PY_COMPILED:
    237         return load_compiled(name, filename, file)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_source(name, pathname, file)
    167     methods = _SpecMethods(spec)
    168     if name in sys.modules:
--> 169         module = methods.exec(sys.modules[name])
    170     else:
    171         module = methods.load()

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec_module(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/Users/tim/sinai/git/varcode/varcode/string_helpers.py in <module>()
     15 from __future__ import print_function, division, absolute_import
     16 
---> 17 import pyfaidx
     18 
     19 def reverse_complement(x):

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in deep_import_hook(name, globals, locals, fromlist, level)
    250     parent, buf = get_parent(globals, level)
    251 
--> 252     head, name, buf = load_next(parent, None if level < 0 else parent, name, buf)
    253 
    254     tail = head

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in load_next(mod, altmod, name, buf)
    154     buf += subname
    155 
--> 156     result = import_submodule(mod, subname, buf)
    157     if result is None and mod != altmod:
    158         result = import_submodule(altmod, subname, subname)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in import_submodule(mod, subname, fullname)
    199 
    200         try:
--> 201             m = imp.load_module(fullname, fp, filename, stuff)
    202         except:
    203             # load_module probably removed name from modules because of

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_module(name, file, filename, details)
    243             return load_dynamic(name, filename, file)
    244     elif type_ == PKG_DIRECTORY:
--> 245         return load_package(name, filename)
    246     elif type_ == C_BUILTIN:
    247         return init_builtin(name)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/imp.py in load_package(name, path)
    213     methods = _SpecMethods(spec)
    214     if name in sys.modules:
--> 215         return methods.exec(sys.modules[name])
    216     else:
    217         return methods.load()

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _exec(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in exec_module(self, module)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/pyfaidx/__init__.py in <module>()
     10 import warnings
     11 from six import PY2, PY3, string_types
---> 12 from six.moves import zip_longest
     13 try:
     14     from collections import OrderedDict

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in deep_import_hook(name, globals, locals, fromlist, level)
    254     tail = head
    255     while name:
--> 256         tail, name, buf = load_next(tail, tail, name, buf)
    257 
    258     # If tail is None, both get_parent and load_next found

/Users/tim/sinai/git/datacache/venv/lib/python3.4/site-packages/IPython/lib/deepreload.py in load_next(mod, altmod, name, buf)
    161 
    162     if result is None:
--> 163         raise ImportError("No module named %.200s" % name)
    164 
    165     return result, next, buf

ImportError: No module named moves

empty variant collection when loading strelka vcf

Loading the below strelka shared strelka VCF, I seem to get an empty variant collection back.

>> vc = varcode.VariantCollection("/Users/tim/Box Sync/pt189/variants/strelka-somatic-primary-illumina-exome/passed.somatic.snvs.vcf")
>> len(vc)
0

Add unit tests for alternate translation start site for neurexin-1β (Camacho-Garcia et. al.)

Recently, two point mutations altering the translation initiation site of NRXN1β (c.−3G>T and c.3G>T) have been described in patients with autism and mental retardation. In this study, we analyzed the NRXN1β gene in a sample of 153 patients with autism. We report the identification of a novel mutation, c.3G>A (p.Met1), affecting the translation initiation site. Expression analysis showed that the c.3G>A mutation switches the translation start site of NRXN1β to an in-frame downstream methionine and decreases synaptic levels of the mutant protein in cultured neurons. These data reinforce a role for synaptic defects of NRXN1β in neurodevelopmental disorders.

http://journals.lww.com/psychgenetics/Abstract/2013/12000/Rare_variants_analysis_of_neurexin_1__in_autism.8.aspx

Issue with n_skip?

I'm trying to debug this variant:

/Users/tavi/drive/work/repos/cancer/varcode/varcode/coding_effect.py in coding_effect(ref, alt, transcript_offset, variant, transcript)
    125             sequence_from_start_codon=sequence_from_start_codon,
    126             variant=variant,
--> 127             transcript=transcript)

/Users/tavi/drive/work/repos/cancer/varcode/varcode/frameshift_coding_effect.py in frameshift_coding_effect(ref, alt, cds_offset, sequence_from_start_codon, variant, transcript)
    216         sequence_from_mutated_codon=sequence_from_mutated_codon,
    217         variant=variant,
--> 218         transcript=transcript)

/Users/tavi/drive/work/repos/cancer/varcode/varcode/frameshift_coding_effect.py in _frameshift(mutated_codon_index, sequence_from_mutated_codon, variant, transcript)
     82         print "hello"
     83         import pdb; pdb.set_trace()
---> 84     aa_ref = original_protein_sequence[aa_pos]
     85 
     86     # TODO: what if all the shifted amino acids were the same and the protein

/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/Bio/Seq.pyc in __getitem__(self, index)
    231         if isinstance(index, int):
    232             # Return a single letter as a string
--> 233             return self._data[index]
    234         else:
    235             # Return the (sub)sequence as another Seq object

IndexError: string index out of range

Variant(contig=11, start=63676705, ref=T, alt=., genome=GRCh37)

Issue: aa_pos is 779, but the length of original_protein_sequence is also 779.

More details:

cds_offset = 2335
sequence_from_start_codon = length 4122
mutated_codon_index = cds_offset / 3 = 778
n_skip = 1
sequence_after_mutated_codon = length 1788

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.