Coder Social home page Coder Social logo

mdshw5 / pyfaidx Goto Github PK

View Code? Open in Web Editor NEW
447.0 9.0 74.0 11.99 MB

Efficient pythonic random access to fasta subsequences

Home Page: https://pypi.python.org/pypi/pyfaidx

License: Other

Python 100.00%
python bioinformatics fasta indexing dna protein genomics bgzf samtools

pyfaidx's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyfaidx's Issues

pyfaidx correctly installed but not working. error: unrecognized arguments

I installed pyfaidx as indicated.
Typing

faidx 
usage: faidx [-h] [-b BED] [--stats] [-c] [-r] [-n] [--split-files] [--lazy]
             [--default-seq DEFAULT_SEQ] [-d DELIMITER]
             [--mask-with-default-seq | --mask-by-case] [--version]
             fasta [regions [regions ...]]
.....

hower, when I try to run something it always tells me that there is an unrecognized arguments in my comand. The same is true if I use the example files.

faidx --stats tests/data/genes.fasta
usage: faidx [-h] [-b BED] [--stats] [-c] [-r] [-n] [--split-files] [--lazy]
             [--default-seq DEFAULT_SEQ] [-d DELIMITER]
             [--mask-with-default-seq | --mask-by-case] [--version]
             fasta [regions [regions ...]]
faidx: error: unrecognized arguments: tests/data/genes.fasta

I do not understand what I am missing here.

Reindex file if time stamp on fasta is newer than on index?

It would be great if the index were automatically recreated if the fasta file has a new time stamp than the .fai file. I just did this test as we had to rebuild a fasta we are using and reindexing did not occur.

thanks for considering!

Read actual defline from FASTA file index "gap"

One current design limitation of pyfaidx is that it mirrors the samtools indexing behavior of truncating headers after whitespace. There is a good reason for this - any whitespace in the identifier would break the index file. A side effect of this is that frequently the "description" in a header will be lost when reading into the file using the index. It seems like an option to recover the full header line would be useful, and pretty cheap to implement.

To determine the byte offset and length of the header from the index file, we can determine the byte end of the preceding sequence by adding unprintable characters, and this should be the byte start of the real header line. We can then read from header byte start to sequence offset and save this as something like Sequence.long_name.

Is pyfaidx suited to do full genome lookups?

I am using fasta files representing the whole genome, with the only header lines being >chr1, >chr2 etc. and my indexes are huge, i.e. 37093012.

Will pyfaidx still perform well? If yes, this info should probably be added to the README.md.

Why does Bio.SeqIO and Pyfaidx give different results?

I'm sure it is me mucking up somehow, but when I use Pyfaidx and Bio.SeqIO I get different results...

reference_genome_fasta = "/home/endrebak/genomes/hg38/hg38.fa"
position = slice(68344571,68348572)
hg38 = Fasta(reference_genome_fasta)
gene_pyfaidx = hg38["chr15"][position].seq
gene_pyfaidx[:20] #"aaaaaaaaaaaattccctgt"
#but with Bio.SeqIO...
with open(reference_genome_fasta, "rU") as genome_file_handle:
    chromosome_dictionary = SeqIO.to_dict(SeqIO.parse(genome_file_handle, "fasta"))
gene_bio_seqio = chromosome_dictionary["chr15"][position]
str(gene_bio_seqio[:20].seq) #'AGGGGGACTAGACTCGAAGA'

And when I try to look up the same sequence in UCSC, I get yet another sequence: https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr15:68346496-68346499&hgsid=198858039_mEdzdwARqPcSygOiGTzyEzcF87Ed

(only four first nucleotides shown ("ATGA") to get nuclotide resolution and added one to index since UCSC one-indexed).

Sequence contains non-DNA characters

This error is coming up in 0.3.4 and does not come up in 0.3.1

Traceback (most recent call last):
File "Step1.py", line 151, in
main()
File "Step1.py", line 123, in main
desired_sequence = -chromosome_infile[chromosome][start:stop] #if negative strand, give reverse complement in fasta file
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 105, in neg
return self[::-1].complement
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 140, in complement
comp = self.class(self.name, complement(self.seq),
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 591, in complement
raise ValueError("Sequence contains non-DNA characters: \n {0}".format(''.join(wrap_sequence(70, seq))))
ValueError: Sequence contains non-DNA characters:
ttgtatggttttgaatggtctacgtagttttagacacgatcctcccatcaaatatcacatttacgaatac
tatttgttcttctttctagaatttaattgttagattgtaatatggaatttcttcatctttttccctcttg
tgtgatttgggtttaggtcatcttcctttctttattatttctgatctcatgtctgttgactttatttctt
atccttgtaccatctcattcaatcgttttggttcaagtcaagaaacttttttagtggttttaattacttg
ttaatcaagctgactcattctattttctcccttcagagttattgattttagtctttacattcacgtctgt
aatggtgactaaattgtctttattttactaatataattattctgttatcatacttataaacatacggttt
tttaacctgttggatctgttttacctatttaaggatctgtatatggtggatggttctgattnggtgtttc
tttatcttttatgcttgtctgtatattgatcgtttctctaacttagttactaggttttagacggttgttt
cttttagggtcctcgactaccgaagtaagcacttaagatggtttgtaaatttcttctcaattgtgattag
gaagagttttcaaaggttttttcaaattccattcccctgtgaaagtttgaataagaagttccggttctaa
tgagaaagtggtttcgttctgtttccatgatgttattttaattcgatgtctagttatagagactatttat
gactacatttttaggagttggtttataatcacttgtcttaattcactgtgtagtttttttaatatgtgta
ctggagcaccctaaataaagactttaagttcctatcaagttgtatacttttacttaattacattattgat
gtaattgtcttacttctttgtcaagggtacactagtagagttaactacctctttttataaactgttttag
gttgtggaaaggtactattttttgtangtggtttgatccttgtcttcctttgatggagttgtattaattc
cggtatatacctttcgggtaacgattgtagtatgaattactactttctaacttttgaaaagaagattcga
ttctgtgttcccttcttacggatgaaaatttaagtcgtatcattacctttagaactggtctatttaatcc
tttatttttcgtaattttaaccctttatttttcattttgctagggaaagtatctactacattaaagtata
catcttttggggtttctaaggtattctttgtactatcttgattattgttttaatcgtttgaatgtcttat
gagtcgtatgtctttaatcgacacaaagatatgttattgttacttgttggactttttctttaattctttc
gttaaagtaaatttcaccctggtttgttttattttttatctttgtttgacttggttcctccactttctac
acatgtgacttttgatactttataacgagtttctttaatttctgtgatgtttttatctgtaggaaacaag
tatctactcttctggattttaactttctacaccagagatctaatttacgttaaatagaattttaaggttg
tagttaaaaaaaaatgttttgatctttttagataggattgtaagtatatcttagagttccctgggattta
tcgtttttgttagaactttttcttgtgtcaacctccagagtgtgaagaatcaaagttttaaagattttta
gatgtcataatttttgtctcaccataactgcatatctgtctgtgtatctagtcactctatcttatctctc
gggtctttatttgagaacatatataccggaacactaaaaaatttttcaacgatcctgataggttatccct
tttcctgtcagaaaagttgtttacaatgacccttttgacctatcggtgtacgttttcttacttcaacatg
ggaatggaatgtggtatacgtctttaatttttatcagtttctgaatctatactctcaatcttatgagtgt
tcttttacatccccgtgtcgaagtaccgtaacgtaaaccgttcctaaagaaccgatgctgtggctttcgt
atccgttgtttttttagtctattaaccttgtacatttataatttttaaaaacacttagtttactgtgaca
gttgtctcattttccgttgtgtgacttaccgtcctttataaacatttaatatatagactattgcctaatg
ataggtatatatttcttgagaacgttaagttgttggttttttgttgtgttaattttttatctgtttcctg
aacttatatataaagaggttactatatgtttactggttattcatgtacttttctacgagttgttgtactt
attaatccttttacgtttagttttggtgtcactctatgatgtgttgatatctttgttataccaccaaaga
gtctaaagtgtactaggtcgttaatgtggagacccatatatggagtttcttaattttcgtctcggagttt
ctctacaaagatgtgggtacaggtgttgtcgtaataagtgttatcagttttctaccttcgttgggttcac
agacaactgtctacttacctatttgttttacaccgtataggtgttttaccttataataaattggaatttt
tcctttttatta

Single short lines in the middle of a file are not detected

>AB821309.1
ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAACCTTGTCCCTGGCCCGGCCCT
CCTTCAGTTTAGTTGAGGATACCACATTAGAGCCAGAAGATGCCATCTCATCCGGAGATGATGAGA
CACCGATGGTGCGGAAGATTTTGTCAGTGAGAACAGTAACAACAAGAGAGCACCATACTGGACCAACACA

does not raise an error

Fasta(as_raw=True) behavior

I am finding that to get as_raw=True to work correctly, I have to call it twice:

ipdb> fasta = Fasta(fasta_filename, as_raw = True)
ipdb> chrom = fasta.keys()[0]
ipdb> fasta[chrom][1:2]
>chrII:236683-237683(+):2-2
T
ipdb> fasta = Fasta(fasta_filename, as_raw = True)
ipdb> fasta[chrom][1:2]
u'T'

Note that the index for fasta_filename doesn't exist before the first call.

Is this expected behaviour? Does as_raw only work on an existing index? i.e. the index is created at the first call, and then as_raw is used?

Subclass Fasta to ingest VCF and return consensus sequence

This should be straightforward using PyVCF and pysam. The subclass could be called FastaVariant and return a consensus sequence with either homozygous variants or heterozygous variants or both included as substitutions. Skip over indels and complex variants. MNPs can be handled but the length of the return sequence should be checked.

[bug] GATK-bundle - duplicate key?

I would like to replace pygr with pyfaidx.
Previously I have been using the human_g1k_v37_decoy.fasta file, which is part of the GATK bundle (available through FTP) for getting sequences. I use the same file for alignments.

>>> from pyfaidx import Fasta
>>> genome = Fasta('human_g1k_v37_decoy.fasta')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 552, in __init__
    filt_function=filt_function)
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 209, in __init__
    self.read_fai(split_char)
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 244, in read_fai
    raise ValueError('Duplicate key "%s"' % rname)
ValueError: Duplicate key "['2', 'dna:chromosome', 'chromosome:GRCh37:2:1:243199373:1']"

The human_g1k_v37_decoy.fasta.fai file does not seem to contain that key twice.

Assess whether Fasta class is actually pygr API compatible

I wrote the Fasta class to provide some pygr interface elements, and then mention that pyfaidx has a pygr-compatible interface. I'm not sure this is true, and should evaluate whether implementing more of the pygr interface would be worthwhile.

Floating point values are allowed as file offsets

Using python 3.4.0 and whatever version of pyfaidx is currently grabbed by pip (you should add __version__ to the module)...

Simple access (and slicing) both end up trying to use floats where ints are required.

Traceback (most recent call last):
  File "./Ngaps.py", line 207, in <module>
    sys.exit(Main(argv=None))
  File "./Ngaps.py", line 100, in Main
    x = ref[0][100]
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 460, in __getitem__
    return self._fa.get_seq(self.name, n + 1, n + 1)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 561, in get_seq
    return self.faidx.fetch(name, start, end)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 336, in fetch
    self.fill_buffer(name, start, end + self.read_ahead)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 326, in fill_buffer
    seq = self.from_file(name, start, end)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 380, in from_file
    seq = self.file.read(seq_blen).decode()
TypeError: 'float' object cannot be interpreted as an integer

A quick fix (perhaps a hack) is to cast start and end to ints in from_file:

        start0 = int(start) - 1  # make coordinates [0,1)

and

        seq_len = int(end) - start0

Sorry, no time at the moment to thoroughly test, much less fork your most current code and proved you with a proper diff.

PS: The title of this issue isn't quite correct... It is a call to read where the error gets tripped, not an index.

PPS: Useful module! Thanks. Not super complicated, but thanks for doing it right and saving the rest of us from having to reinvent the wheel (or use SeqIO where it really isn't appropriate IMO).

long_name missing leading chr number

I think I may have found a bug. I can work around it, but thought you would want to fix it. It is really weird.

fasta file: mouse NCBI 38: GRCm38_68.fa

Symptom: on one record (chr 4), the long_name is missing the leading chr number. All the others are OK.

Hope the listings below help. (and thanks VERY much for pyfaidx…)
-Al

Confirm that the record is correct in the file:

$ awk '/^>4/{print $0; exit;}' GRCm38_68.fa

4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF

Open the file and see the top level:

from pyfaidx import Fasta
fa = Fasta('GRCm38_68.fa')
fa.keys()
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y', 'JH584295.1', 'JH584292.1', 'GL456368.1', 'GL456396.1', 'GL456359.1', 'GL456382.1', 'GL456392.1', 'GL456394.1', 'GL456390.1', 'GL456387.1', 'GL456381.1', 'GL456370.1', 'GL456372.1', 'GL456389.1', 'GL456378.1', 'GL456360.1', 'GL456385.1', 'GL456383.1', 'GL456213.1', 'GL456239.1', 'GL456367.1', 'GL456366.1', 'GL456393.1', 'GL456216.1', 'GL456379.1', 'JH584304.1', 'GL456212.1', 'JH584302.1', 'JH584303.1', 'GL456210.1', 'GL456219.1', 'JH584300.1', 'JH584298.1', 'JH584294.1', 'GL456354.1', 'JH584296.1', 'JH584297.1', 'GL456221.1', 'JH584293.1', 'GL456350.1', 'GL456211.1', 'JH584301.1', 'GL456233.1', 'JH584299.1']

See that the long name is OK for chr 3:

print fa['3'].long_name
3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF

See that chr 4 is not OK:

print fa['4'].long_name
dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF

The next two listings are longish, so I describe them here: first a listing of all the names, then all the long names. The names are OK for all chrs. The long names are all OK except for 4.

for record in fa:
... print record.name
...
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
MT
X
Y
JH584295.1
JH584292.1
GL456368.1
GL456396.1
GL456359.1
GL456382.1
GL456392.1
GL456394.1
GL456390.1
GL456387.1
GL456381.1
GL456370.1
GL456372.1
GL456389.1
GL456378.1
GL456360.1
GL456385.1
GL456383.1
GL456213.1
GL456239.1
GL456367.1
GL456366.1
GL456393.1
GL456216.1
GL456379.1
JH584304.1
GL456212.1
JH584302.1
JH584303.1
GL456210.1
GL456219.1
JH584300.1
JH584298.1
JH584294.1
GL456354.1
JH584296.1
JH584297.1
GL456221.1
JH584293.1
GL456350.1
GL456211.1
JH584301.1
GL456233.1
JH584299.1

Now the long names

for record in fa:
... print record.long_name
...
1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF
14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF
15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF
16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF
17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF
18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF
19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF
2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF
3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF
dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF
5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF
6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF
7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF
8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF
MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF
X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF
Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF
JH584295.1 dna:scaffold scaffold:GRCm38:JH584295.1:1:1976:1 REF
JH584292.1 dna:scaffold scaffold:GRCm38:JH584292.1:1:14945:1 REF
GL456368.1 dna:scaffold scaffold:GRCm38:GL456368.1:1:20208:1 REF
GL456396.1 dna:scaffold scaffold:GRCm38:GL456396.1:1:21240:1 REF
L456359.1 dna:scaffold scaffold:GRCm38:GL456359.1:1:22974:1 REF
GL456382.1 dna:scaffold scaffold:GRCm38:GL456382.1:1:23158:1 REF
GL456392.1 dna:scaffold scaffold:GRCm38:GL456392.1:1:23629:1 REF
GL456394.1 dna:scaffold scaffold:GRCm38:GL456394.1:1:24323:1 REF
GL456390.1 dna:scaffold scaffold:GRCm38:GL456390.1:1:24668:1 REF
GL456387.1 dna:scaffold scaffold:GRCm38:GL456387.1:1:24685:1 REF
GL456381.1 dna:scaffold scaffold:GRCm38:GL456381.1:1:25871:1 REF
GL456370.1 dna:scaffold scaffold:GRCm38:GL456370.1:1:26764:1 REF
GL456372.1 dna:scaffold scaffold:GRCm38:GL456372.1:1:28664:1 REF
GL456389.1 dna:scaffold scaffold:GRCm38:GL456389.1:1:28772:1 REF
GL456378.1 dna:scaffold scaffold:GRCm38:GL456378.1:1:31602:1 REF
GL456360.1 dna:scaffold scaffold:GRCm38:GL456360.1:1:31704:1 REF
GL456385.1 dna:scaffold scaffold:GRCm38:GL456385.1:1:35240:1 REF
GL456383.1 dna:scaffold scaffold:GRCm38:GL456383.1:1:38659:1 REF
GL456213.1 dna:scaffold scaffold:GRCm38:GL456213.1:1:39340:1 REF
GL456239.1 dna:scaffold scaffold:GRCm38:GL456239.1:1:40056:1 REF
GL456367.1 dna:scaffold scaffold:GRCm38:GL456367.1:1:42057:1 REF
GL456366.1 dna:scaffold scaffold:GRCm38:GL456366.1:1:47073:1 REF
GL456393.1 dna:scaffold scaffold:GRCm38:GL456393.1:1:55711:1 REF
GL456216.1 dna:scaffold scaffold:GRCm38:GL456216.1:1:66673:1 REF
GL456379.1 dna:scaffold scaffold:GRCm38:GL456379.1:1:72385:1 REF
JH584304.1 dna:scaffold scaffold:GRCm38:JH584304.1:1:114452:1 REF
GL456212.1 dna:scaffold scaffold:GRCm38:GL456212.1:1:153618:1 REF
JH584302.1 dna:scaffold scaffold:GRCm38:JH584302.1:1:155838:1 REF
JH584303.1 dna:scaffold scaffold:GRCm38:JH584303.1:1:158099:1 REF
GL456210.1 dna:scaffold scaffold:GRCm38:GL456210.1:1:169725:1 REF
GL456219.1 dna:scaffold scaffold:GRCm38:GL456219.1:1:175968:1 REF
JH584300.1 dna:scaffold scaffold:GRCm38:JH584300.1:1:182347:1 REF
JH584298.1 dna:scaffold scaffold:GRCm38:JH584298.1:1:184189:1 REF
JH584294.1 dna:scaffold scaffold:GRCm38:JH584294.1:1:191905:1 REF
GL456354.1 dna:scaffold scaffold:GRCm38:GL456354.1:1:195993:1 REF
JH584296.1 dna:scaffold scaffold:GRCm38:JH584296.1:1:199368:1 REF
JH584297.1 dna:scaffold scaffold:GRCm38:JH584297.1:1:205776:1 REF
GL456221.1 dna:scaffold scaffold:GRCm38:GL456221.1:1:206961:1 REF
JH584293.1 dna:scaffold scaffold:GRCm38:JH584293.1:1:207968:1 REF
GL456350.1 dna:scaffold scaffold:GRCm38:GL456350.1:1:227966:1 REF
GL456211.1 dna:scaffold scaffold:GRCm38:GL456211.1:1:241735:1 REF
JH584301.1 dna:scaffold scaffold:GRCm38:JH584301.1:1:259875:1 REF
GL456233.1 dna:scaffold scaffold:GRCm38:GL456233.1:1:336933:1 REF
JH584299.1 dna:scaffold scaffold:GRCm38:JH584299.1:1:953012:1 REF

Decrease memory footprint with large indices

@jsilter Has a use case for opening the entire NCBI non redundant nucleotide database, which is currently problematic. The index for this database is 24,312,914 entries and 1.2G on disk. The in-memory index is far larger than this, and so it's possible that we might not be able to store an index this large in memory with the current data structures. However, there are some simple ways to reduce memory footprint already:

  • remove Faidx raw_index attribute. I'm not sure anyone was using this, so it should not be an issue.
  • nest named tuples instead of another dict in Faidx.index's OrderedDict
    • 31% space savings just by using named tuples (40% total savings w/ dropping raw_index list)
    • additional 4% savings by using OrderedBytesDict (not sure it's worth it.)

Expose FastaRecord `__iter__` method for hybrid line/object-based iteration

Sometimes it would be useful to do something like:

for line in FastaRecord:
  do something with the line

Currently FastaRecord's __getitem__ returns the entire contents of a FASTA entry as a string. Maybe a __iter__ method could return smaller chunks to limit memory consumption. Might be able to use the read-ahead buffer from #26.

Option for mutable Fasta file

Sometimes (masking regions) it is convenient to modify a fasta file in place if possible. We can read the number of line breaks to be inserted, and for entries of the same size as the original fetched region we can modify the fasta file in place.

'Sequence' object breaks string methods upper(), split(), etc

See example below:

>>> from pyfaidx import Fasta
>>> ref = Fasta('my.fasta')
>>> ref.keys()
['myseq']
>>> ref['myseq'][:5].upper()
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'Sequence' object has no attribute 'upper'
>>> ref['myseq'][:5].split('T')
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'Sequence' object has no attribute 'split'

To fix this:

The easy way is to change this line in Faidx.fetch():

return Sequence(name=rname, start=int(start + 1),
                     end=int(end), seq=seq.replace('\n', ''))

to

return seq.replace('\n', '')

i.e. return a pure string, rather than a Sequence object.

The hard way is to reimplement string methods in Sequence object, like Bio.Seq.Seq.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.