tariqdaouda / pygeno Goto Github PK

View Code? Open in Web Editor NEW

309.0 25.0 50.0 10.81 MB

Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs

Home Page: http://pygeno.iric.ca

License: Apache License 2.0

Python 92.70% Makefile 3.60% Batchfile 3.70%

bioinformatics biology genomics proteomics genome genome-annotation genome-browser genome-sequencing genomes medical

pygeno's People

Contributors

Stargazers

Watchers

Forkers

logan169 gendatapro davidykay vangorden ajchan11 r0k3 techscientist julianvargasalvarez courcelm alenzhao vdda robinqi ericloud datagold2017 aleckyann adelq rshigemura sudarshangc wizardelf jonathanseguin kdelmore inambioinfo enterstudio bijiaha0 captainnemon sharonp57 biocodings dafenqi ktp-forked-repos sk1350 arogers7 raonyguimaraes arangoml feghalya marieleyse ealong cw00dw0rd maximumko novapyth hssmrll medical-projects habibmrad yz46606 shicheng-guo jubaer145 xbxhm liu5796796 pelamee vszhang1976 computerscienceiscool eunbak-ji

pygeno's Issues

GenomicLink (edges) does not work

Linking between different object is not happening

Translate mitochondrial chromosome with Vertebrate Mitochondrial Code

Current sequence:
("gene:Gene, name: MT-ND1, id: ENSG00000198888, strand: '+' > Chromosome: number MT > <Raba obj: ('Genome_Raba', 0.11207103328668244), raba_id: 1>", '\n')
('seq:IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYITAPTLALTIALLLTPLPIPNPLVNLNLGLLFILATSSLAVYSILSGASNSNYALIGALRAVAQTISYEVTLAIILLSTLLISGSFNLSTLITTQEHLLLLPSPLAIIFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLIRTAYPRFRYDQLIHLLKNFLPLTLALLI*YVSIPITISSIPPQT', '\n')

Expected sequence:
MPMANLLLLIVPILIAMAFLMLTERKILGYMQLRKGPNVVGPYGLLQPFADAMKLFTKEP
LKPATSTITLYITAPTLALTIALLLWTPLPMPNPLVNLNLGLLFILATSSLAVYSILWSG
WASNSNYALIGALRAVAQTISYEVTLAIILLSTLLMSGSFNLSTLITTQEHLWLLLPSWP
LAMMWFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFMAEYTNIIMMNTLTTT
IFLGTTYDALSPELYTTYFVTKTLLLTSLFLWIRTAYPRFRYDQLMHLLWKNFLPLTLAL
LMWYVSMPITISSIPPQT

Cannot install pyGeno successfully

I tried to install pyGeno in python2.7 ubuntu16.04.10，but Can't install successfully. The error message is as follows：
Traceback (most recent call last):
File "", line 1, in
File "/home/dongl/.local/lib/python2.7/site-packages/pyGeno/init.py", line 3, in
from .configuration import pyGeno_init
File "/home/dongl/.local/lib/python2.7/site-packages/pyGeno/configuration.py", line 3, in
import rabaDB.rabaSetup
File "/home/dongl/.local/lib/python2.7/site-packages/rabaDB/rabaSetup.py", line 24
class RabaConfiguration(object, metaclass=RabaNameSpaceSingleton) :
^
SyntaxError: invalid syntax

how can I solve this problem? Please help me

mutant (SNV/indel) protein sequence generation

Hi Tariq,

Thank you for helping me on loading genome yesterday.

I further tested pyGeno on generating a mutant protein sequence.

Actually, I uploaded two ERBB2 variants and two EGFR indels with known AA change annotation, (ENST00000445658:c.T1888G:p.W630G and ENST00000445658:exon16:c.A1879G:p.S627G)

(ENST00000455089:c.2100_2114del:p.700_705del and ENST00000455089:c.2161_2162insTGGCCAGCG:p.M721delinsMASV)

However, I find the mutant protein sequence generated by pyGeno is exactly the same with the corresponding reference protein sequence. It seems the filter doesn't work. Would you point out what's wrong with my commands in pyGeno. ( Also, the test_var file was enclosed in attachment.) If there is anyway to efficiently implement both indels and SNP in one Filter?

Thank you.

Best

Hao

from pyGeno.importation.Genomes import *
from pyGeno.importation.SNPs import *
from pyGeno.Genome import *
from pyGeno.Transcript import Transcript
from pyGeno.SNPFiltering import SNPFilter
from pyGeno.SNPFiltering import SequenceSNP
from pyGeno.SNPFiltering import SequenceInsert
from pyGeno.SNPFiltering import SequenceDel
importSNPs('/test_snp_path/test_var')
class QMax_gt_filter(SNPFilter) :
... def init(self, threshold) :
... self.threshold = threshold
... def filter(self, chromosome, test_var = None) :
... if test_var.quality > self.threshold :
... #other possibilities of return are SequenceInsert(), SequenceDel()
... if test_var.alt[0] == '-':
... return SequenceDel(len(test_var.ref))
... if test_var.ref[0] == '-':
... return SequenceInsert(test_var.alt)
... elif test_var.alt[0] != '-' and test_var.ref[0] != '-':
... return SequenceSNP(test_var.alt)
... return None

mut_G = Genome(name = 'GRCh37.75', SNPs = 'test_var', SNPFilter = QMax_gt_filter(8))
mut_trans = mut_G.get(Transcript, id ='ENST00000445658')
mut_prot = mut_trans[0].protein
mut_prot.sequence

ref_G = Genome(name = 'GRCh37.75')
ref_trans = ref_G.get(Transcript, id ='ENST00000445658')
ref_prot = ref_trans[0].protein
ref_prot.sequence
`

mut_prot sequence output:
'MELAALCRWGLLLALLPPGAASTQDNYLSTDVGSCTLVCPLHNQEVTAEDGTQRCEKCSKPCARVCYGLGMEHLREVRAVTSANIQEFAGCKKIFGSLAFLPESFDGDPASNTAPLQPEQLQVFETLEEITGYLYISAWPDSLPDLSVFQNLQVIRGRILHNGAYSLTLQGLGISWLGLRSLRELGSGLALIHHNTHLCFVHTVPWDQLFRNPHQALLHTANRPEDECVGEGLACHQLCARGHCWGPGPTQCVNCSQFLRGQECVEECRVLQGLPREYVNARHCLPCHPECQPQNGSVTCFGPEADQCVACAHYKDPPFCVARCPSGVKPDLSYMPIWKFPDEEGACQPCPINCTHSCVDLDDKGCPAEQRASPLTSIISAVVGILLVVVLGVVFGILIKRRQQKIRKYTMRRLLQETELVEPLTPSGAMPNQAQMRILKETELRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFVVIQNEDLGPASPLDSTFYRSLLEDDDMGDLVDAEEYLVPQQGFFCPDPAPGAGGMVHHRHRSSSTRSGGGDLTLGLEPSEEEAPRSPLAPSEGAGSDVFDGDLGMGAAKGLQSLPTHDPSPLQRYSEDPTVPLPSETDGYVAPLTCSPQPEYVNQPDVRPQPPSPREGPLPAARPAGATLERPKTLSPGKNGVVKDVFAFGGAVENPEYLTPQGGAAPQPHPPPAFSPAFDNLYYWDQDPPERGAPPSTFKGTPTAENPEYLGLDVPV`

ref_prot.sequence output:
'MELAALCRWGLLLALLPPGAASTQDNYLSTDVGSCTLVCPLHNQEVTAEDGTQRCEKCSKPCARVCYGLGMEHLREVRAVTSANIQEFAGCKKIFGSLAFLPESFDGDPASNTAPLQPEQLQVFETLEEITGYLYISAWPDSLPDLSVFQNLQVIRGRILHNGAYSLTLQGLGISWLGLRSLRELGSGLALIHHNTHLCFVHTVPWDQLFRNPHQALLHTANRPEDECVGEGLACHQLCARGHCWGPGPTQCVNCSQFLRGQECVEECRVLQGLPREYVNARHCLPCHPECQPQNGSVTCFGPEADQCVACAHYKDPPFCVARCPSGVKPDLSYMPIWKFPDEEGACQPCPINCTHSCVDLDDKGCPAEQRASPLTSIISAVVGILLVVVLGVVFGILIKRRQQKIRKYTMRRLLQETELVEPLTPSGAMPNQAQMRILKETELRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFVVIQNEDLGPASPLDSTFYRSLLEDDDMGDLVDAEEYLVPQQGFFCPDPAPGAGGMVHHRHRSSSTRSGGGDLTLGLEPSEEEAPRSPLAPSEGAGSDVFDGDLGMGAAKGLQSLPTHDPSPLQRYSEDPTVPLPSETDGYVAPLTCSPQPEYVNQPDVRPQPPSPREGPLPAARPAGATLERPKTLSPGKNGVVKDVFAFGGAVENPEYLTPQGGAAPQPHPPPAFSPAFDNLYYWDQDPPERGAPPSTFKGTPTAENPEYLGLDVPV`

manifest.ini

`[package_infos]
description = mutant peptide generation
maintainer = Tariq Daouda
maintainer_contact = [email protected]
version = 1

[set_infos]
species = human
name = test_var
type = Agnostic
source = TCGA variants

[snps]
filename = test_var.txt
`

test_var.txt

chromosomeNumber uniqueId start end ref alt quality caller
17 1 37881637 37881637 A G 255 GATK
17 2 37881646 37881646 T G 255 GATK
7 3 55242465 55242479 GGAATTAAGAGAAGC - 255 GATK
7 4 55242465 55242479 - TGGCCAGCG 255 GATK

printDatawraps fails when using setup.py install

The directory containing pyGeno's datawraps (bootstrap_data) doesn't get copied automatically when installing with 'python setup.py install' (in opposition to 'develop') or when installing with pip.

Python 3 support

Make pyGeno compatible with python 3.

Error in importing my_SNP.tar.gz

Hi, I was trying to import my SNP file and I followed your instruction here. However, I got the error message as follows:
Importing polymorphism set: my/path/to/my_SNP.tar.gz... (This may take a while)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyGeno/importation/SNPs.py", line 65, in importSNPs
return _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile)
File "pyGeno/importation/SNPs.py", line 192, in _importSNPs_dbSNPSNP
snpData = VCFFile(snpsFile, gziped = True, stream = True)
File "pyGeno/tools/parsers/VCFTools.py", line 89, in __init__
self.parse(filename, gziped, stream)
File "pyGeno/tools/parsers/VCFTools.py", line 106, in parse
ll = self.f.readline()
File "/usr/lib/python2.7/gzip.py", line 464, in readline
c = self.read(readsize)
File "/usr/lib/python2.7/gzip.py", line 268, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 303, in _read
self._read_gzip_header()
File "/usr/lib/python2.7/gzip.py", line 197, in _read_gzip_header
raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file

I zipped both manifest.ini and snps.txt to my_SNP.tar.gz using tar -cvzf.

Do you have any idea why this issue comes up and how I could fix this?

Thanks

pyGeno does not

The following issue has arisen

Using the pip install pyGeno command, pyGeno was installed as seen below.

"C:\Windows\system32>pip install pyGeno
Requirement already satisfied (use --upgrade to upgrade): pyGeno in c:\python27\lib\site-packages
Requirement already satisfied (use --upgrade to upgrade): rabaDB>=1.0.2 in c:\python27\lib\site-packages (from pyGeno)
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command."

However, the genome import did not work with suggested command "import pyGeno.bootstrap as B "

It has given the following error:

"Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\pyGeno__init__.py", line 4, in
pyGeno_init()
File "C:\Python27\lib\site-packages\pyGeno\configuration.py", line 103, in pyGeno_init
db = rabaDB.rabaSetup.RabaConnection(pyGeno_RABA_NAMESPACE)
File "C:\Python27\lib\site-packages\rabaDB\rabaSetup.py", line 21, in call
cls._instances[key] = type.call(cls, _args, *_kwargs)
File "C:\Python27\lib\site-packages\rabaDB\rabaSetup.py", line 48, in init
self.connection = sq.connect(RabaConfiguration(namespace).dbFile)
sqlite3.OperationalError: unable to open database file"

When we checked the pyGene in the C:\Python27\Lib\site-packages, both pyGene and rabaDB are found in that folder. We then uninstalled pyGene and Python 2.7.12 and reinstalled them to resolve the problem but the problem persists. Unfortunately, we could not figure out a way to resolve the problem.

Incompatibility of pyGene with non-English Windows 10 is the only thing we could think of as there are non-english characters. Any idea what to do?

Quick example on the home page does not work

The quick example on the homepage contains several typos and omissions (six in total!) that make the code unrunnable.

In general, but especially the quick example should be run through python to ensure that it works before pasting it into the docs. It also should be a self sufficient that does not need other information to run.

Since we are here I will make a note that this page does not do a good job in demonstrating what the library actually does. The actually interesting part for me as a python programmer is not that the library can extract the sequences for ensemble proteins - that is a job I can do with many tools already.

What interest me getting to the next level is combining and querying the Ensemble genes and the SNPs at the same time.

But the quick start stops at the most interesting line:

g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter()

ok there is promise here, but now what can I do here once I have this construct? What is MyFilter() what does that do.

Installation not working on Mac OS Yosemite 10.10.5

Hi Tariq,

Following the recommended installation, I tried to run this simple script:

#! /usr/local/bin/python
from pyGeno.Genome import *
g = Genome(name = "GRCh37.75")
sys.exit()

But I get this error:

Traceback (most recent call last):
  File "./script.py", line 8, in <module>
    g = Genome(name = "GRCh37.75")
  File "/usr/local/pyGeno/pyGeno/Genome.py", line 67, in __init__
    pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
  File "/usr/local/pyGeno/pyGeno/pyGenoObjectBases.py", line 83, in __init__
    self.wrapped_object = self._wrapped_class(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/rabaDB/Raba.py", line 301, in __call__
    raise KeyError("Couldn't find any object that fit the arguments you've prodided to the constructor")
KeyError: "Couldn't find any object that fit the arguments you've prodided to the constructor"

Any ideas?

no module named configuration

when calling
from pyGeno.Genome import *

ImportError: No module named 'configuration'

Unsupported translation of selenocysteine

Proteins with a selenocysteine are translated with a stop codon instead of U. GTF file stores codon position of selenocysteine.

2 ensembl_havana Selenocysteine 84670381 84670383 . - . gene_id "ENSMUSG00000076437"; gene_version "10"; transcript_id "ENSMUST00000117299"; transcript_version "8"; gene_name "Selenoh"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000013495"; havana_gene_version "5"; transcript_name "Selenoh-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS16187"; havana_transcript "OTTMUST00000032615"; havana_transcript_version "2"; tag "seleno"; tag "basic"; transcript_support_level "1";

Any suggestion how to support this feature?

Write tests

We need more tests

Error: Genome object instantiation

Hi,
When I tried to create a Genome object, I received the following error:

from pyGeno.Genome import *
g = Genome(name = "GRCh37.75")

KeyError: "Couldn't find any object that fit the arguments you've prodided to the constructor".

I installed pyGeno with pip.

Thanks.

Multiprocessing problem with sqlite

Problems arise when importing pyGeno in main thread, and accessing the DB from spawned processes.

Example error: pygeno: DatabaseError: file is encrypted or is not a database

Asking for SNPs through get() does not work

Asking for SNPs through get() does not work. Please use the raba interface for retreiving SNPs:

from rabaDB.filters import *

f = RabaQuery('dbSNPSNP')
f.addFilter({"chromosomeNumber =" : 22, "start >":  x1, "end <": x2})
snps = f.run()

This will be fixed in the next issue.

Connecting data an logic

Connecting the data from arangodb to the logic in pyGeno.

pip installation does not work

pip install pyGeno

then

>>> import pyGeno.bootstrap as B
>>> B.printDatawraps()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ialbert/.virtualenvs/work/lib/python2.7/site-packages/pyGeno/bootstrap.py", line 24, in printDatawraps
    l = listDatawraps()
  File "/Users/ialbert/.virtualenvs/work/lib/python2.7/site-packages/pyGeno/bootstrap.py", line 12, in listDatawraps
    for f in os.listdir(os.path.join(this_dir, "bootstrap_data/genomes")) :
OSError: [Errno 2] No such file or directory: '/Users/ialbert/.virtualenvs/work/lib/python2.7/site-packages/pyGeno/bootstrap_data/genomes'

Importation of SNPs

Untouched, should remove Casava and TopHat

Remote datawraps

The following example doesn`t work.

B.printRemoteDatawraps()
Traceback (most recent call last):
File "", line 1, in
File "/usr/src/app/pyGeno/bootstrap.py", line 45, in printRemoteDatawraps
l = listRemoteDatawraps(location)
File "/usr/src/app/pyGeno/bootstrap.py", line 15, in listRemoteDatawraps
js = json.loads(response.read())
File "/usr/local/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 6 column 3 (char 113)

checkPythonVersion is not ready for python 3

The checkPythonVersion will fail for some version of python 3.

It is not yet documented which version of python 3 is supported with pyGeno 2.0.0

AgnosticSNP quality is a string and should be float

This is an issue for comparison because in python 2.7:

'0.001' > 20
True

For SNP filter this will silently fail:

    def filter(self, chromosome, **kwargs):

        for snp_set, snp in kwargs.iteritems():

                if snp.quality > self.threshold:

                    return SequenceSNP(snp.alt)

snp.quality must be cast to float to obtain the expected result

Syntax in documentation for advanced queries (e.g., calling a list of gene within start and end coordinates) problem

Good afternoon, this question is related to making advanced queries on pyGeno with .get

Based on this documentation, we can query for specific details as such:

#even complex stuff
exons = myChromosome.get(Exons, {'start >=' : x1, 'stop <' : x2})
hlaGenes = myGenome.get(Gene, {'name like' : 'HLA'})

sry = myGenome.get(Transcript, { "gene.name" : 'SRY' })

Unfortunately, none of these commands seem to work, while basic commands for getting specific genes based on their ids work:

#in this case both queries will yield the same result
myGene.get(Protein, id = "ENSID...")
myGenome.get(Protein, id = "ENSID...")

In this situation, I am attempting to call a list of genes within a particular set of coordinates on a particular chromosome. To illustrate the problem, I use .get to call p53 (Chr17 in humans):

# getting the gene based on id
gene_example = g.get(Gene, id = 'ENSG00000141510')

# confirming the gene based on chromosome - note that I give the index [0] because for some reason, .get seems to generate a single-index list of the Raba object
print(gene_example[0].chromosome.number)
>17

# now, I get the start and end coords
x1 = gene[0].start
x2 = gene[0].end

# finally, I test getting the gene using the coords
gene_test = g.get(Gene, {'start >=': x1, 'end <=': x2, 'chromosome.number': 17})

Ultimately, gene_test is not assigned to any value because g.get can't find anything within those coordinates. Even when I tested by replacing x1 and x2 with nearly the entire chromosomal length, no genes were identified.

Would anyone happen to know the correct syntax for this? Perhaps it has changed in recent updates. Thank you!

urllib error during B.importGenome in pip version

The pip version of the package generates an FTP error,
"IOError: [Errno ftr error] 200 Switching to Binary Mode", during bootstrap import of at least Human.GRCh37.75.tar.gz.

The traceback indicates line 46 of importation/Genomes.py may be causing the issue.
_getFile function:
line 46: urllib.urlretrieve (fil, finalFile)

The GitHub bloody branch replaces line 46 of importation/Genomes.py with an iterator and seems to resolve this issue.

You might want to update the pip version of pyGeno. Thanks for making this package available under Apache 2.0!

TRACEBACK:

IOError Traceback (most recent call last)
in ()
----> 1 get_ipython().magic(u'time B.importGenome("Human.GRCh37.75.tar.gz")')

/opt/conda/envs/python2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
2156 magic_name, _, magic_arg_s = arg_s.partition(' ')
2157 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158 return self.run_line_magic(magic_name, magic_arg_s)
2159
2160 #-------------------------------------------------------------------------

/opt/conda/envs/python2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2077 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2078 with self.builtin_trap:
-> 2079 result = fn(*args,**kwargs)
2080 return result
2081

in time(self, line, cell, local_ns)

/opt/conda/envs/python2/lib/python2.7/site-packages/IPython/core/magic.pyc in (f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):

/opt/conda/envs/python2/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
1179 if mode=='eval':
1180 st = clock2()
-> 1181 out = eval(code, glob, local_ns)
1182 end = clock2()
1183 else:

in ()

/opt/conda/envs/python2/lib/python2.7/site-packages/pyGeno/bootstrap.pyc in importGenome(name, batchSize)
100 """Import a genome shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties."""
101 path = os.path.join(this_dir, "bootstrap_data", "genomes/" + name)
--> 102 PG.importGenome(path, batchSize)
103
104 def importSNPs(name) :

/opt/conda/envs/python2/lib/python2.7/site-packages/pyGeno/importation/Genomes.pyc in importGenome(packageFile, batchSize, verbose)
149 raise KeyError("The directory %s already exists, Please call deleteGenome() first if you want to reinstall" % seqTargetDir)
150
--> 151 gtfFile = _getFile(parser.get('gene_set', 'gtf'), packageDir)
152
153 chromosomesFiles = {}

/opt/conda/envs/python2/lib/python2.7/site-packages/pyGeno/importation/Genomes.pyc in _getFile(fil, directory)
44 printf("Downloading file: %s..." % fil)
45 finalFile = os.path.normpath('%s/%s' %(directory, fil.split('/')[-1]))
---> 46 urllib.urlretrieve (fil, finalFile)
47 printf('done.')
48 else :

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in urlretrieve(url, filename, reporthook, data, context)
96 else:
97 opener = _urlopener
---> 98 return opener.retrieve(url, filename, reporthook, data)
99 def urlcleanup():
100 if _urlopener:

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in retrieve(self, url, filename, reporthook, data)
243 except IOError:
244 pass
--> 245 fp = self.open(url, data)
246 try:
247 headers = fp.info()

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in open(self, fullurl, data)
211 try:
212 if data is None:
--> 213 return getattr(self, name)(url)
214 else:
215 return getattr(self, name)(url, data)

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in open_ftp(self, url)
556 value in ('a', 'A', 'i', 'I', 'd', 'D'):
557 type = value.upper()
--> 558 (fp, retrlen) = self.ftpcache[key].retrfile(file, type)
559 mtype = mimetypes.guess_type("ftp:" + url)[0]
560 headers = ""

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in retrfile(self, file, type)
904 try:
905 cmd = 'RETR ' + file
--> 906 conn, retrlen = self.ftp.ntransfercmd(cmd)
907 except ftplib.error_perm, reason:
908 if str(reason)[:3] != '550':

/opt/conda/envs/python2/lib/python2.7/ftplib.pyc in ntransfercmd(self, cmd, rest)
332 size = None
333 if self.passiveserver:
--> 334 host, port = self.makepasv()
335 conn = socket.create_connection((host, port), self.timeout)
336 try:

/opt/conda/envs/python2/lib/python2.7/ftplib.pyc in makepasv(self)
310 def makepasv(self):
311 if self.af == socket.AF_INET:
--> 312 host, port = parse227(self.sendcmd('PASV'))
313 else:
314 host, port = parse229(self.sendcmd('EPSV'), self.sock.getpeername())

/opt/conda/envs/python2/lib/python2.7/ftplib.pyc in parse227(resp)
828
829 if resp[:3] != '227':
--> 830 raise error_reply, resp
831 global _227_re
832 if _227_re is None:

IOError: [Errno ftp error] 200 Switching to Binary mode.

Integration of the new query method

Queries with "ORs" have to be integrated

sqlite3:OperationalError:No such table:

Hi,

I was trying to import my Genome file and I followed your instruction in the doc.

Start of importation is ok, it write "Importation begins!" progress is at 100% and an error occurs at following step :
"almost done saving chromosomes...
\ progress[ --~-?:> ] ?% (1/?) runtime: ..."
Last sentence of the message is :
"sqlite3.OperationalError: no such table: main.RabaList_exons_for_Transcript_Raba"

How can I solve this problem ?

Thanks for your help.

Problem with insertions

I'm trying to get a translation for an insertion. Using the default SNPFiltering, I do not see the insertion in the sequence. I think it's not added to the sequence (it's is present in the db) and/or is treated as a SequenceSNP. Everything works great for snp.

I tried creating my own filter to support the insertion as explained in #33, but I get this error :

File "/u/boucherg/.virtualenvs/pyGeno_git/pyGeno/pyGeno/SNP.py", line 64, in __getattribute__ return Raba.__getattribute__(self, k) File "build/bdist.linux-x86_64/egg/rabaDB/Raba.py", line 648, in __getattribute__ TypeError: attribute name must be string, not 'int'

I'm not sure what I'm doing wrong. Here is the code and the snspset entry.

chromosomeNumber uniqueId start end ref alleles quality caller
5 1 170837542 170837543 - TCTT 0 custom

from pyGeno.Genome import *
 from pyGeno.importation.SNPs import * 
 from pyGeno.SNPFiltering import SNPFilter

 class MyFilter(SNPFilter) :
   	def __init__(self) :
   		SNPFilter.__init__(self)
   	def filter(self, chromosome, snp_custom) :
   		from pyGeno.SNPFiltering import  SequenceInsert, SequenceSNP, SequenceDel
   		for s in snp_custom:
   			if s.alleles != '-' and s.ref != '-':
   				return SequenceSNP(s.alleles)
   			elif s.alleles == '-':
   				return SequenceDel(len(s.ref))
   			elif s.ref == '-':
   				return SequenceInsert(s.alleles)

   if 'snp_custom' in getSNPSetsList() : 
   	deleteSNPs('snp_custom')

   importSNPs("snps_tmp")
   genome = Genome(name = 'GRCh37.75', SNPs='snp_custom', SNPFilter = MyFilter())
   gene = genome.get(Gene, name='NPM1')[0]
   tr = gene.get(Transcript, name='NPM1-001')[0]
   tr.sequence

[package_infos]
description = SNPs for testing purposes
maintainer = The Maintainer
maintainer_contact = maintainer [at] email.ca
version = 1

[set_infos]
species = human
name = snp_custom
type = agnosticsnp
source = Where do these snps come from?

[snps]
filename = snps.txt

Unknown SNP type in manifest dbSNP, for dbSNP149

Below the error message:

In [1]: import pyGeno.bootstrap as B
In [2]: B.importSNPs("GRCh38p7_dbSNP149_common_all.tar.gz")
Importing polymorphism set: /u/eaudemard/dev/pyGeno/pyGeno/bootstrap_data/SNPs/GRCh38p7_dbSNP149_common_all.tar.gz... (This may take a while)
Downloading file: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b149_GRCh38p7/VCF/common_all_20161122.vcf.gz...
done.
---------------------------------------------------------------------------
FutureWarning                             Traceback (most recent call last)
<ipython-input-2-e9741c79a81d> in <module>()
----> 1 B.importSNPs("GRCh38p7_dbSNP149_common_all.tar.gz")

/u/eaudemard/dev/pyGeno/pyGeno/bootstrap.pyc in importSNPs(name)
    108         """Import a SNP set shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties."""
    109         path = os.path.join(this_dir, "bootstrap_data", "SNPs/" + name)
--> 110         PS.importSNPs(path)

/u/eaudemard/dev/pyGeno/pyGeno/importation/SNPs.pyc in importSNPs(packageFile)
     69                         return _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile)
     70                 else :
---> 71                         raise FutureWarning('Unknown SNP type in manifest %s' % typ)
     72         else :
     73                 raise KeyError("There's already a SNP set by the name %s. Use deleteSNPs() to remove it first" %setName)

FutureWarning: Unknown SNP type in manifest dbSNP

here the manifest.ini:

[package_infos]
description = SNP set for dbSNP149 that contains only common SNP. For more details: http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
maintainer = Eric Audemard
maintainer_contact = [email protected]
version = 1

[set_infos]
species = human
name = GRCh38p7_dbSNP149_common_all
type = dbSNP
source = ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b149_GRCh38p7/VCF/common_all_20161122.vcf.gz

[snps]
filename = ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b149_GRCh38p7/VCF/common_all_20161122.vcf.gz

Genome importation

Make sure everything is imported the right way

0 based vs 1 based (ensembl)
Selenocysteines

Out of frame protein sequences

Problem: proteins whose translation start sites are not certain gives out of frame sequences.
Solution: Somehow frame of the first exon should be included while generating CDS.

refGenome=Genome(name="GRCh38.80")
refProt=refGenome.get(Protein,id="ENSP00000349216")[0]
print "pyGeno"
print refProt.sequence
gencode_seq="XHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQS
RCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLV
SALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGL
AQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGF
LPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQ
RRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTG
ARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFP
YAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDG
ETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPEREL
GTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC"
print "GENCODE"
print gencode_seq
first_exon_frame=refProt.transcript.exons[0].frame
print first_exon_frame
new_seq= "X"+translateDNA(refProt.transcript.cDNA[0:-3],frame="f"+str(1+first_exon_frame))
print "Corrected sequence"
print new_seq
print showDifferences(gencode_seq,new_seq)