mdshw5 / pyfaidx Goto Github PK

View Code? Open in Web Editor NEW

447.0 9.0 74.0 11.99 MB

Efficient pythonic random access to fasta subsequences

Home Page: https://pypi.python.org/pypi/pyfaidx

License: Other

Python 100.00%

python bioinformatics fasta indexing dna protein genomics bgzf samtools

pyfaidx's People

Stargazers

Watchers

Forkers

brentp azalea jsilter spheregenomics eclarke sveint jorisbenschop antoniogps mattions jeffbhasin timodonnell jrderuiter juefish nvictus yfu jasper1918 snashraf zzygyx9119 fw1121 sunhuaibo terrycojones dhauge scchess biocodings ylgao1 imb-fmg edwardbetts simonvh palc wy2160640 chenatf palakpsheth hsiaoyi0504 skerker varir rbs-pli edawson prihoda iostreamatlab abdo3a cez026 c-nit hardingnj yanglq-bio folded wook2014 flystar233 structuralgenetics y9c kopardev lldelisle muntakimrafi gileshall palao manzt lry198010 ipark2021 ap-- johanneskoester aroon-color emollier siebrenf capuanob mayhemheroes chengaoxiang1985 barrymoore ghostintheshellarise biorad-jw genostack a-detiste rraadd88 asleonard jondeans marcelm

pyfaidx's Issues

Check that Faidx.fetch does not return header characters when passing a negative start coordinate

Format `faidx --stats` using presets e.g. "bed" "chromsizes" "nucleotide"...

"stats" right now just consists of the sequence names and lengths. It might be more useful to allow simple bed files, UCSC-style "chrom.sizes" files, and more advanced nucleotide (gc%, a/c/g/t, top kmer, etc...) formats to be specified like: faidx --stats bed and so on.

pyfaidx correctly installed but not working. error: unrecognized arguments

I installed pyfaidx as indicated.
Typing

faidx 
usage: faidx [-h] [-b BED] [--stats] [-c] [-r] [-n] [--split-files] [--lazy]
             [--default-seq DEFAULT_SEQ] [-d DELIMITER]
             [--mask-with-default-seq | --mask-by-case] [--version]
             fasta [regions [regions ...]]
.....

hower, when I try to run something it always tells me that there is an unrecognized arguments in my comand. The same is true if I use the example files.

faidx --stats tests/data/genes.fasta
usage: faidx [-h] [-b BED] [--stats] [-c] [-r] [-n] [--split-files] [--lazy]
             [--default-seq DEFAULT_SEQ] [-d DELIMITER]
             [--mask-with-default-seq | --mask-by-case] [--version]
             fasta [regions [regions ...]]
faidx: error: unrecognized arguments: tests/data/genes.fasta

I do not understand what I am missing here.

Add --version flag

Properly handle lowercase characters in FASTA

Specifically need to worry any time I do string replacement.

Better error message for non IUPAC nucleotide characters

I should just print the invalid sequence and line number to assist users in correcting invalid input sequence raised in #47.

FastaVariant needs tests

Reindex file if time stamp on fasta is newer than on index?

It would be great if the index were automatically recreated if the fasta file has a new time stamp than the .fai file. I just did this test as we had to rebuild a fasta we are using and reindexing did not occur.

thanks for considering!

Defer fetching sequence from file until `str()` is called on Sequence.

Read actual defline from FASTA file index "gap"

One current design limitation of pyfaidx is that it mirrors the samtools indexing behavior of truncating headers after whitespace. There is a good reason for this - any whitespace in the identifier would break the index file. A side effect of this is that frequently the "description" in a header will be lost when reading into the file using the index. It seems like an option to recover the full header line would be useful, and pretty cheap to implement.

To determine the byte offset and length of the header from the index file, we can determine the byte end of the preceding sequence by adding unprintable characters, and this should be the byte start of the real header line. We can then read from header byte start to sequence offset and save this as something like Sequence.long_name.

Is pyfaidx suited to do full genome lookups?

I am using fasta files representing the whole genome, with the only header lines being >chr1, >chr2 etc. and my indexes are huge, i.e. 37093012.

Will pyfaidx still perform well? If yes, this info should probably be added to the README.md.

fasta indexing is broken

Faidx.build_index gives different results than samtools

Why does Bio.SeqIO and Pyfaidx give different results?

I'm sure it is me mucking up somehow, but when I use Pyfaidx and Bio.SeqIO I get different results...

reference_genome_fasta = "/home/endrebak/genomes/hg38/hg38.fa"
position = slice(68344571,68348572)
hg38 = Fasta(reference_genome_fasta)
gene_pyfaidx = hg38["chr15"][position].seq
gene_pyfaidx[:20] #"aaaaaaaaaaaattccctgt"
#but with Bio.SeqIO...
with open(reference_genome_fasta, "rU") as genome_file_handle:
    chromosome_dictionary = SeqIO.to_dict(SeqIO.parse(genome_file_handle, "fasta"))
gene_bio_seqio = chromosome_dictionary["chr15"][position]
str(gene_bio_seqio[:20].seq) #'AGGGGGACTAGACTCGAAGA'

And when I try to look up the same sequence in UCSC, I get yet another sequence: https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr15:68346496-68346499&hgsid=198858039_mEdzdwARqPcSygOiGTzyEzcF87Ed

(only four first nucleotides shown ("ATGA") to get nuclotide resolution and added one to index since UCSC one-indexed).

Sequence contains non-DNA characters

This error is coming up in 0.3.4 and does not come up in 0.3.1

Traceback (most recent call last):
File "Step1.py", line 151, in
main()
File "Step1.py", line 123, in main
desired_sequence = -chromosome_infile[chromosome][start:stop] #if negative strand, give reverse complement in fasta file
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 105, in neg
return self[::-1].complement
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 140, in complement
comp = self.class(self.name, complement(self.seq),
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 591, in complement
raise ValueError("Sequence contains non-DNA characters: \n {0}".format(''.join(wrap_sequence(70, seq))))
ValueError: Sequence contains non-DNA characters:
ttgtatggttttgaatggtctacgtagttttagacacgatcctcccatcaaatatcacatttacgaatac
tatttgttcttctttctagaatttaattgttagattgtaatatggaatttcttcatctttttccctcttg
tgtgatttgggtttaggtcatcttcctttctttattatttctgatctcatgtctgttgactttatttctt
atccttgtaccatctcattcaatcgttttggttcaagtcaagaaacttttttagtggttttaattacttg
ttaatcaagctgactcattctattttctcccttcagagttattgattttagtctttacattcacgtctgt
aatggtgactaaattgtctttattttactaatataattattctgttatcatacttataaacatacggttt
tttaacctgttggatctgttttacctatttaaggatctgtatatggtggatggttctgattnggtgtttc
tttatcttttatgcttgtctgtatattgatcgtttctctaacttagttactaggttttagacggttgttt
cttttagggtcctcgactaccgaagtaagcacttaagatggtttgtaaatttcttctcaattgtgattag
gaagagttttcaaaggttttttcaaattccattcccctgtgaaagtttgaataagaagttccggttctaa
tgagaaagtggtttcgttctgtttccatgatgttattttaattcgatgtctagttatagagactatttat
gactacatttttaggagttggtttataatcacttgtcttaattcactgtgtagtttttttaatatgtgta
ctggagcaccctaaataaagactttaagttcctatcaagttgtatacttttacttaattacattattgat
gtaattgtcttacttctttgtcaagggtacactagtagagttaactacctctttttataaactgttttag
gttgtggaaaggtactattttttgtangtggtttgatccttgtcttcctttgatggagttgtattaattc
cggtatatacctttcgggtaacgattgtagtatgaattactactttctaacttttgaaaagaagattcga
ttctgtgttcccttcttacggatgaaaatttaagtcgtatcattacctttagaactggtctatttaatcc
tttatttttcgtaattttaaccctttatttttcattttgctagggaaagtatctactacattaaagtata
catcttttggggtttctaaggtattctttgtactatcttgattattgttttaatcgtttgaatgtcttat
gagtcgtatgtctttaatcgacacaaagatatgttattgttacttgttggactttttctttaattctttc
gttaaagtaaatttcaccctggtttgttttattttttatctttgtttgacttggttcctccactttctac
acatgtgacttttgatactttataacgagtttctttaatttctgtgatgtttttatctgtaggaaacaag
tatctactcttctggattttaactttctacaccagagatctaatttacgttaaatagaattttaaggttg
tagttaaaaaaaaatgttttgatctttttagataggattgtaagtatatcttagagttccctgggattta
tcgtttttgttagaactttttcttgtgtcaacctccagagtgtgaagaatcaaagttttaaagattttta
gatgtcataatttttgtctcaccataactgcatatctgtctgtgtatctagtcactctatcttatctctc
gggtctttatttgagaacatatataccggaacactaaaaaatttttcaacgatcctgataggttatccct
tttcctgtcagaaaagttgtttacaatgacccttttgacctatcggtgtacgttttcttacttcaacatg
ggaatggaatgtggtatacgtctttaatttttatcagtttctgaatctatactctcaatcttatgagtgt
tcttttacatccccgtgtcgaagtaccgtaacgtaaaccgttcctaaagaaccgatgctgtggctttcgt
atccgttgtttttttagtctattaaccttgtacatttataatttttaaaaacacttagtttactgtgaca
gttgtctcattttccgttgtgtgacttaccgtcctttataaacatttaatatatagactattgcctaatg
ataggtatatatttcttgagaacgttaagttgttggttttttgttgtgttaattttttatctgtttcctg
aacttatatataaagaggttactatatgtttactggttattcatgtacttttctacgagttgttgtactt
attaatccttttacgtttagttttggtgtcactctatgatgtgttgatatctttgttataccaccaaaga
gtctaaagtgtactaggtcgttaatgtggagacccatatatggagtttcttaattttcgtctcggagttt
ctctacaaagatgtgggtacaggtgttgtcgtaataagtgttatcagttttctaccttcgttgggttcac
agacaactgtctacttacctatttgttttacaccgtataggtgttttaccttataataaattggaatttt
tcctttttatta

Add read ahead cache option for sequential accession

Take a look at the htslib faidx.c for closer implementation

https://github.com/samtools/htslib/blob/develop/faidx.c

Single short lines in the middle of a file are not detected

>AB821309.1
ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAACCTTGTCCCTGGCCCGGCCCT
CCTTCAGTTTAGTTGAGGATACCACATTAGAGCCAGAAGATGCCATCTCATCCGGAGATGATGAGA
CACCGATGGTGCGGAAGATTTTGTCAGTGAGAACAGTAACAACAAGAGAGCACCATACTGGACCAACACA

does not raise an error

Fasta(as_raw=True) behavior

I am finding that to get as_raw=True to work correctly, I have to call it twice:

ipdb> fasta = Fasta(fasta_filename, as_raw = True)
ipdb> chrom = fasta.keys()[0]
ipdb> fasta[chrom][1:2]
>chrII:236683-237683(+):2-2
T
ipdb> fasta = Fasta(fasta_filename, as_raw = True)
ipdb> fasta[chrom][1:2]
u'T'

Note that the index for fasta_filename doesn't exist before the first call.

Is this expected behaviour? Does as_raw only work on an existing index? i.e. the index is created at the first call, and then as_raw is used?

merge "bedmask" into cli.py

Add Bio.SeqIO test case

Add option `--stats` to print basic stats about the fasta file

Catch KeyError in faidx and warn user on stderr

It might be too harsh to exit faidx when sequence names don't exist. Instead, warning the user that the sequence name doesn't exist and then moving on would be more useful. See: https://www.biostars.org/p/141241/#141306

Generalized `Sequence` object to work with protein and RNA seqs

The original module was developed for the use case of accessing DNA strings from a reference genome file. It should be easy to generalize this to protein and RNA strings as well.

add checks for .complement() method

Subclass Fasta to ingest VCF and return consensus sequence

This should be straightforward using PyVCF and pysam. The subclass could be called FastaVariant and return a consensus sequence with either homozygous variants or heterozygous variants or both included as substitutions. Skip over indels and complex variants. MNPs can be handled but the length of the return sequence should be checked.

Rewrite tests to function more independently, download test data instead of storing in version control

[bug] GATK-bundle - duplicate key?

I would like to replace pygr with pyfaidx.
Previously I have been using the human_g1k_v37_decoy.fasta file, which is part of the GATK bundle (available through FTP) for getting sequences. I use the same file for alignments.

>>> from pyfaidx import Fasta
>>> genome = Fasta('human_g1k_v37_decoy.fasta')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 552, in __init__
    filt_function=filt_function)
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 209, in __init__
    self.read_fai(split_char)
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 244, in read_fai
    raise ValueError('Duplicate key "%s"' % rname)
ValueError: Duplicate key "['2', 'dna:chromosome', 'chromosome:GRCh37:2:1:243199373:1']"

The human_g1k_v37_decoy.fasta.fai file does not seem to contain that key twice.

Assess whether Fasta class is actually pygr API compatible

I wrote the Fasta class to provide some pygr interface elements, and then mention that pyfaidx has a pygr-compatible interface. I'm not sure this is true, and should evaluate whether implementing more of the pygr interface would be worthwhile.

Floating point values are allowed as file offsets

Using python 3.4.0 and whatever version of pyfaidx is currently grabbed by pip (you should add __version__ to the module)...

Simple access (and slicing) both end up trying to use floats where ints are required.

Traceback (most recent call last):
  File "./Ngaps.py", line 207, in <module>
    sys.exit(Main(argv=None))
  File "./Ngaps.py", line 100, in Main
    x = ref[0][100]
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 460, in __getitem__
    return self._fa.get_seq(self.name, n + 1, n + 1)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 561, in get_seq
    return self.faidx.fetch(name, start, end)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 336, in fetch
    self.fill_buffer(name, start, end + self.read_ahead)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 326, in fill_buffer
    seq = self.from_file(name, start, end)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 380, in from_file
    seq = self.file.read(seq_blen).decode()
TypeError: 'float' object cannot be interpreted as an integer

A quick fix (perhaps a hack) is to cast start and end to ints in from_file:

        start0 = int(start) - 1  # make coordinates [0,1)

and

        seq_len = int(end) - start0

Sorry, no time at the moment to thoroughly test, much less fork your most current code and proved you with a proper diff.

PS: The title of this issue isn't quite correct... It is a call to read where the error gets tripped, not an index.

PPS: Useful module! Thanks. Not super complicated, but thanks for doing it right and saving the rest of us from having to reinvent the wheel (or use SeqIO where it really isn't appropriate IMO).

Benchmark against other comparable tools

htslib
pyfasta
fastahack

Skip biopython tests on pypy3 and cpython3.2

When we can't install BioPython we should just skip those tests.

read-ahead caching is not effective at mitigating reads from disk

It looks like the read_ahead property is implemented with a bug that prevents proper caching.

long_name missing leading chr number

I think I may have found a bug. I can work around it, but thought you would want to fix it. It is really weird.

fasta file: mouse NCBI 38: GRCm38_68.fa

Symptom: on one record (chr 4), the long_name is missing the leading chr number. All the others are OK.

Hope the listings below help. (and thanks VERY much for pyfaidx…)
-Al

Confirm that the record is correct in the file:

$ awk '/^>4/{print $0; exit;}' GRCm38_68.fa

4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF

Open the file and see the top level:

from pyfaidx import Fasta
fa = Fasta('GRCm38_68.fa')
fa.keys()
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y', 'JH584295.1', 'JH584292.1', 'GL456368.1', 'GL456396.1', 'GL456359.1', 'GL456382.1', 'GL456392.1', 'GL456394.1', 'GL456390.1', 'GL456387.1', 'GL456381.1', 'GL456370.1', 'GL456372.1', 'GL456389.1', 'GL456378.1', 'GL456360.1', 'GL456385.1', 'GL456383.1', 'GL456213.1', 'GL456239.1', 'GL456367.1', 'GL456366.1', 'GL456393.1', 'GL456216.1', 'GL456379.1', 'JH584304.1', 'GL456212.1', 'JH584302.1', 'JH584303.1', 'GL456210.1', 'GL456219.1', 'JH584300.1', 'JH584298.1', 'JH584294.1', 'GL456354.1', 'JH584296.1', 'JH584297.1', 'GL456221.1', 'JH584293.1', 'GL456350.1', 'GL456211.1', 'JH584301.1', 'GL456233.1', 'JH584299.1']

See that the long name is OK for chr 3:

print fa['3'].long_name
3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF

See that chr 4 is not OK:

print fa['4'].long_name
dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF

The next two listings are longish, so I describe them here: first a listing of all the names, then all the long names. The names are OK for all chrs. The long names are all OK except for 4.

for record in fa:
... print record.name
...
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
MT
X
Y
JH584295.1
JH584292.1
GL456368.1
GL456396.1
GL456359.1
GL456382.1
GL456392.1
GL456394.1
GL456390.1
GL456387.1
GL456381.1
GL456370.1
GL456372.1
GL456389.1
GL456378.1
GL456360.1
GL456385.1
GL456383.1
GL456213.1
GL456239.1
GL456367.1
GL456366.1
GL456393.1
GL456216.1
GL456379.1
JH584304.1
GL456212.1
JH584302.1
JH584303.1
GL456210.1
GL456219.1
JH584300.1
JH584298.1
JH584294.1
GL456354.1
JH584296.1
JH584297.1
GL456221.1
JH584293.1
GL456350.1
GL456211.1
JH584301.1
GL456233.1
JH584299.1

Now the long names

for record in fa:
... print record.long_name
...
1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF
14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF
15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF
16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF
17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF
18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF
19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF
2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF
3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF
dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF
5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF
6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF
7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF
8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF
MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF
X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF
Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF
JH584295.1 dna:scaffold scaffold:GRCm38:JH584295.1:1:1976:1 REF
JH584292.1 dna:scaffold scaffold:GRCm38:JH584292.1:1:14945:1 REF
GL456368.1 dna:scaffold scaffold:GRCm38:GL456368.1:1:20208:1 REF
GL456396.1 dna:scaffold scaffold:GRCm38:GL456396.1:1:21240:1 REF
L456359.1 dna:scaffold scaffold:GRCm38:GL456359.1:1:22974:1 REF
GL456382.1 dna:scaffold scaffold:GRCm38:GL456382.1:1:23158:1 REF
GL456392.1 dna:scaffold scaffold:GRCm38:GL456392.1:1:23629:1 REF
GL456394.1 dna:scaffold scaffold:GRCm38:GL456394.1:1:24323:1 REF
GL456390.1 dna:scaffold scaffold:GRCm38:GL456390.1:1:24668:1 REF
GL456387.1 dna:scaffold scaffold:GRCm38:GL456387.1:1:24685:1 REF
GL456381.1 dna:scaffold scaffold:GRCm38:GL456381.1:1:25871:1 REF
GL456370.1 dna:scaffold scaffold:GRCm38:GL456370.1:1:26764:1 REF
GL456372.1 dna:scaffold scaffold:GRCm38:GL456372.1:1:28664:1 REF
GL456389.1 dna:scaffold scaffold:GRCm38:GL456389.1:1:28772:1 REF
GL456378.1 dna:scaffold scaffold:GRCm38:GL456378.1:1:31602:1 REF
GL456360.1 dna:scaffold scaffold:GRCm38:GL456360.1:1:31704:1 REF
GL456385.1 dna:scaffold scaffold:GRCm38:GL456385.1:1:35240:1 REF
GL456383.1 dna:scaffold scaffold:GRCm38:GL456383.1:1:38659:1 REF
GL456213.1 dna:scaffold scaffold:GRCm38:GL456213.1:1:39340:1 REF
GL456239.1 dna:scaffold scaffold:GRCm38:GL456239.1:1:40056:1 REF
GL456367.1 dna:scaffold scaffold:GRCm38:GL456367.1:1:42057:1 REF
GL456366.1 dna:scaffold scaffold:GRCm38:GL456366.1:1:47073:1 REF
GL456393.1 dna:scaffold scaffold:GRCm38:GL456393.1:1:55711:1 REF
GL456216.1 dna:scaffold scaffold:GRCm38:GL456216.1:1:66673:1 REF
GL456379.1 dna:scaffold scaffold:GRCm38:GL456379.1:1:72385:1 REF
JH584304.1 dna:scaffold scaffold:GRCm38:JH584304.1:1:114452:1 REF
GL456212.1 dna:scaffold scaffold:GRCm38:GL456212.1:1:153618:1 REF
JH584302.1 dna:scaffold scaffold:GRCm38:JH584302.1:1:155838:1 REF
JH584303.1 dna:scaffold scaffold:GRCm38:JH584303.1:1:158099:1 REF
GL456210.1 dna:scaffold scaffold:GRCm38:GL456210.1:1:169725:1 REF
GL456219.1 dna:scaffold scaffold:GRCm38:GL456219.1:1:175968:1 REF
JH584300.1 dna:scaffold scaffold:GRCm38:JH584300.1:1:182347:1 REF
JH584298.1 dna:scaffold scaffold:GRCm38:JH584298.1:1:184189:1 REF
JH584294.1 dna:scaffold scaffold:GRCm38:JH584294.1:1:191905:1 REF
GL456354.1 dna:scaffold scaffold:GRCm38:GL456354.1:1:195993:1 REF
JH584296.1 dna:scaffold scaffold:GRCm38:JH584296.1:1:199368:1 REF
JH584297.1 dna:scaffold scaffold:GRCm38:JH584297.1:1:205776:1 REF
GL456221.1 dna:scaffold scaffold:GRCm38:GL456221.1:1:206961:1 REF
JH584293.1 dna:scaffold scaffold:GRCm38:JH584293.1:1:207968:1 REF
GL456350.1 dna:scaffold scaffold:GRCm38:GL456350.1:1:227966:1 REF
GL456211.1 dna:scaffold scaffold:GRCm38:GL456211.1:1:241735:1 REF
JH584301.1 dna:scaffold scaffold:GRCm38:JH584301.1:1:259875:1 REF
GL456233.1 dna:scaffold scaffold:GRCm38:GL456233.1:1:336933:1 REF
JH584299.1 dna:scaffold scaffold:GRCm38:JH584299.1:1:953012:1 REF

Indexing [0:0] is different than standard Python string

>>> gfasta = pyfaidx.Fasta('....fa', as_raw=True)
>>> gfasta['chr1'][0:0]

Returns the whole chromosome

>>> s='ABCD'
>>> s[0:0]

Returns an empty string

IMHO, pyfaidx should return an empty string.

Check sequence is composed of allowed characters

http://en.wikipedia.org/wiki/FASTA_format

Add nose test for Genome.neg

Pass default_seq parameter when calling Fasta init method

Take a look at the manuscript?

The manuscript is something... maybe @brentp, @azalea, and @swheelan can give me some comments? Feel free to edit as you see fit - the document is version controlled.

Split header lines on a value

Faidx and Fasta option to re-build the index file

Decrease memory footprint with large indices

@jsilter Has a use case for opening the entire NCBI non redundant nucleotide database, which is currently problematic. The index for this database is 24,312,914 entries and 1.2G on disk. The in-memory index is far larger than this, and so it's possible that we might not be able to store an index this large in memory with the current data structures. However, there are some simple ways to reduce memory footprint already:

remove Faidx raw_index attribute. I'm not sure anyone was using this, so it should not be an issue.
nest named tuples instead of another dict in Faidx.index's OrderedDict
- 31% space savings just by using named tuples (40% total savings w/ dropping raw_index list)
- additional 4% savings by using OrderedBytesDict (not sure it's worth it.)

Expose FastaRecord `iter` method for hybrid line/object-based iteration

Sometimes it would be useful to do something like:

for line in FastaRecord:
  do something with the line

Currently FastaRecord's __getitem__ returns the entire contents of a FASTA entry as a string. Maybe a __iter__ method could return smaller chunks to limit memory consumption. Might be able to use the read-ahead buffer from #26.

>>> from pyfaidx import Fasta
>>> ref = Fasta('my.fasta')
>>> ref.keys()
['myseq']
>>> ref['myseq'][:5].upper()
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'Sequence' object has no attribute 'upper'
>>> ref['myseq'][:5].split('T')
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'Sequence' object has no attribute 'split'

To fix this:

The easy way is to change this line in Faidx.fetch():

return Sequence(name=rname, start=int(start + 1),
                     end=int(end), seq=seq.replace('\n', ''))

return seq.replace('\n', '')

i.e. return a pure string, rather than a Sequence object.

The hard way is to reimplement string methods in Sequence object, like Bio.Seq.Seq.

mdshw5 / pyfaidx Goto Github PK

pyfaidx's People

Stargazers

Watchers

Forkers

pyfaidx's Issues

Recommend Projects

Recommend Topics

Recommend Org