lattice-automation / seqfold Goto Github PK

View Code? Open in Web Editor NEW

79.0 79.0 12.0 135 KB

nucleic acid folding

License: MIT License

Makefile 0.56% Python 99.44%

dna dna-folding dna-structure-prediction folding rna rna-folding rna-structure-prediction structure-prediction

seqfold's People

Contributors

Stargazers

Watchers

Forkers

laserson timothystiles conan19842002 simone-codeluppi goosang-yu xuesoso juankaamez duao42 ftry wjidea nhhaidee guohao150

seqfold's Issues

Maximum Depth?

When trying to run DG on a sequence that is 1448 base pairs long, I encounter the following error:
RecursionError: maximum recursion depth exceeded in comparison

with the traceback:

Traceback (most recent call last):
File "C:\Users\Yash Lal\Desktop\Coevolution-Informatics\playground2.py", line 5, in
v = dg(s)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 89, in dg
structs = fold(seq, temp)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 69, in fold
v_cache, w_cache = _cache(seq, temp)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 161, in _cache
_w(seq, 0, n - 1, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
[Previous line repeated 989 more times]
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 191, in _w
if w_cache[i][j] != STRUCT_DEFAULT:
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 24, in eq
return self.e == other.e and self.ij == other.ij
RecursionError: maximum recursion depth exceeded in comparison

Is there anything I can do to make dg work with such a long sequence?

Thank you!

Add function that exposes a fold Cache

Useful for things like primer design where the user is going to query many possible start + end positions, checking each for a high delta g, ie low foldability

Ionic conditions correction

Hello,

Any plans to include oligomer corrections for ionic conditions (Na+ and Mg++), similar to mFold?

Thanks!

from typing import List (one line I had to add to python demo example in README.md) Also ct format doable?

my own TODO

Great work! I may try to tweak to get output in .vienna format which has the mfe appened on same line as dot-brackets. I want to count bulges, and one way to do so involves converting .vienna dot-bracket to .ct (connect) format. Seems doable ;-)

Single-char Ttpo in README.md where I think seqfold -v -l output should have ddg not dg that is: i j ddg description not i j dg description
BTW what does ddg stand for? May also want to mention version(s) of python 3.x required. I think typing did not start showing up until 3.5 or 3.6, right? And it works differently in 3.9 per https://stackoverflow.com/questions/57505071/nameerror-name-list-is-not-defined/

Also, one line I needed in python demo

Note I had to add one extra line to the python demo in README.md
from typing import List
Adding above line got rid of NameError... as shown below. I just pasted python demo from README.md into fileseqfold-demo.py file.

$ head -4  seqfold-demo.py. # after adding `from typing import List` line
from seqfold import dg, dg_cache, fold, Struct
from typing import List
# just returns minimum free energy

$ python. seqfold-demo.py

-15.200000000000001
   0   48   -2.1  STACK:GG/CC    
   1   47   -2.1  STACK:GG/CC    
   2   46   -1.3  STACK:GA/CT    
   3   45   -1.3  STACK:AG/TC    
   4   44   -2.1  STACK:GG/CC    
   5   43   -1.5  STACK:GT/CA    
   6   42   -1.3  STACK:TC/AG    
   7   41   -0.2  BIFURCATION:4n/3h
   9   22   -1.0  STACK:TT/AA    
  10   21   -0.9  STACK:TA/AT    
  11   20   -1.5  STACK:AC/TG    
  12   19    3.1  HAIRPIN:CA/GG  
  25   39   -2.1  STACK:CC/GG    
  26   38   -2.2  STACK:CG/GC    
  27   37   -2.1  STACK:GG/CC    
  28   36    3.4  HAIRPIN:GT/CT

Without that one extra line I get `NameError...`

$ head -3  seqfold-demo.py # without `from typing import List` line
from seqfold import dg, dg_cache, fold, Struct

# just returns minimum free energy

$ python seqfold-demo.py
Traceback (most recent call last):
  File "seqfold-demo.py", line 7, in <module>
    structs: List[Struct] = fold("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC")
NameError: name 'List' is not defined
$ python --version
Python 3.8.8

Numba compatibility?

I tried to Numba jit the dg function as a longshot and it appears that it failed. Have y'all tried Numba compiling any parts of this to increase the performance?

Expected result dose not match calculated result

Hi, I was just tried to work with the following example from the README file:

just returns minimum free energy
dg("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC", temp = 37.0) # -12.94

And the result I got is -13.4 (and not -12.94 as stated in the example).
Could this be due to dependencies or something else?

Thanks

Add conda to readme

Hello there, seqfold is now on bioconda. You could add this to your readme:

conda install -c bioconda seqfold

Great packages btw!

Add print method for folding output

I like this minimalist format:

		SEQ	--------///---\\\---
		STR	GTGGCCGCGGCGAAGCCTCT

Possible bug

Hi, I really like the package. A good alternative to Vienna (which isn't on pypi).

But when verifing your example sequence output against Vienna's RNAfold, it looks like seqfold misses two pair bonds? the lowercased g-c and a-u

         GGGAGGUCgUUACAUCUGGGUAAcaCCGGUACUGAUCCGGuGACCUCCC
RNAfold  (((((((((((((......)))))(((((.......))))))))))))) (-25.10)
seqfold  ((((((((.((((......))))..((((.......)))).)))))))) (-25.5)

inf values returned from dg() on RNA sequences

Hello,

I'm not sure if this is necessarily a bug, but I was hoping to gain some clarity on why inf values would be returned for some RNA sequences. Note this is not the case if the sequence is converted to DNA.

from seqfold import dg

# RNA sequences
>>> dg("CAGCGCGGCGGGCGGGAGUCCGGCGCGCCCUCCAUCCCCGGCGGCGUCGGCAAGGAGUAG", temp=37)
inf
>>> dg("ACAAGCAGGAGCCUAUAAAUCCCUGAGGGGGCUGCUGGGACAUCACAGAAGGUGAAAGUC", temp=37)
inf

# DNA sequences
>>> dg("CAGCGCGGCGGGCGGGAGTCCGGCGCGCCCTCCATCCCCGGCGGCGTCGGCAAGGAGTAG", temp=37)
-12.6
>>> dg("ACAAGCAGGAGCCTATAAATCCCTGAGGGGGCTGCTGGGACATCACAGAAGGTGAAAGTC", temp=37)
-5.3

dot_bracket output length does not match sequence length

Hello,

For the given sequence ATATCTAGGTAGATAGCTAAGTAAATAACTACGTACATACCTCTCGGTCGATCGCTCAGTCAATCACTCCGTCCATCCCGGGGAGGGCGGAAGGACGGCAGGCCGAGAGCGAAAGAACGACAGACCGCGCAAGCACGCCAGCCCAAAACAACCACACCCCTTTTGTTTATTTCTTGGTTGATTGCTTAGTTAATTACTTC, the output of dot_bracket function with temp=55, the resulting dot_bracket result: ........................................((.(((.((..((((((...........((((((.(((..........))).))))))........))))))..)).))).)) does not have the same length

linear structure probability

hello,
Thank you very much for the tool you developed to predict the secondary structure of a sequence. However, if we need to predict the probability of a sequence maintaining a linear structure, how should we calculate it?

ilead-cong

Incorrect DNA_NN values in dictionary

Hi, I think your values are entered wrong.

Below are corrected values based on The Thermodynamics of DNA Structural Motifs
Annual Review of Biophysics and Biomolecular Structure

Vol. 33:415-440 (Volume publication date 9 June 2004)
https://doi.org/10.1146/annurev.biophys.32.110601.141800
'''
DNA_NN: BpEnergy = {
"init": (0.2, -5.7),
"init_G/C": (0.0, 0.0),
"init_A/T": (2.2, 6.9),
"sym": (0, -1.4),
"AA/TT": (-7.6, -21.3),
"AT/TA": (-7.2, -20.4),
"TA/AT": (-7.2, -21.3),
"CA/GT": (-8.5, -22.7),
"GT/CA": (-8.4, -22.4),
"CT/GA": (-7.8, -21.0),
"GA/CT": (-8.2, -22.2),
"CG/GC": (-10.6, -27.2),
"GC/CG": (-9.8, -24.4),
"GG/CC": (-8.0, -19.9),
}
'''

If this is not correct could you explain your values?

Thanks for putting this together! It is a great resource!

Add degenerate base functionality

I have been using seqfold to design oligo constructs, but have run into issues when I use degenerate bases (like in unique molecular identifiers, UMIs). Would there be a straightforward way to allow for degenerate bases? A minimal version of this might be to ignore their interactions, but include them in the backbone? Thanks!

DOI for citations

Hi! Thank you for implementing this! There is quite a lack of open source tools for energy calculations, so this fills a much-needed niche.

Would you ever consider adding a DOI to make it easier to cite?

I think this repository/package is increasingly being sought out due to a lack of python bindings for other tools, and I imagine some people would like to cite it in papers (perhaps even citing a particular release for reproducibility). If you're interested/willing, Github makes DOI registration fairly straightforward when paired with Zenodo.

Converts lists to Numpy arrays

Attempt at perf improvements.

Convert lists to numpy arrays

seqfold not finding folding at certain sequence lengths

Hello,
Thank you for making and maintaining this tool.
I'm trying to figure out the best folding for a sequence of unknown length, so I'm trying all reasonable sequence lengths 2-100 nt
the issue is that for certain lengths, seqfold is unable to find more than a single paired base, even though there are plenty of hairpins in the sequence.
in the attached image, each line is the dotbrackt view of one seqfold run, each with +1nt of the same sequence

do you know what this software behaviour could be caused by?

Best

Tue

Folding visualizations and UNAfold differences

Hello and thanks so much for writing accessible code! It's well needed.

I am coming from using UNAfold online, which is luckily still around on IDT's website right now, but I worry that it won't necessarily be there forever. I'm also interested in coding with these programs so UNAfold is out, since the owner of it hasn't responded to anything and I can't find their code anywhere. But when I was using seqfold, I noticed a bunch of differences. Is there a good explanation for this? I trust you, but I'm just not familiar with exactly where all the ambiguity may be coming from. Do you believe that your numbers are more accurate?

Back onto using UNAfold, is there a feature, where I can see the folding that occurs? When I run the fold command in seqfold I just see the base pairing that will occur. Is this for the most likely structure? When I run UNAfold, I get the dG and base pairing for multiple different folded structures.

I also think a brief manual explaining what i and j are would definitely help a bit. Again thanks for all your code and have a nice day!

RNA Linear folding feature

A week later I tried to use seqfold to calculate MFE and secondary structure but it took too much time for an HTTP request through a serverless environment (waiting time is normally limited to 30-60 seconds for performance reasons). My use case was specifically for vaccine RNA sequences, so most of them are greater than 3kbp.

So I used Linear Fold by running a bash script to build the project and putting them inside a folder of my project and using some python wrapping to run the command line LinearFold binary and parsing the output. It was much faster and

I think I can do a better job by really taking the BeamCKYParser function and wrapping using some lib like CPython, however, it will take some time.

What do you think?

folded structure is not the same size of the input sequence and differs significantly from mfold prediction

For example:

from seqfold import fold, dot_bracket, dg 
seq = 'AAAAAAAAAAAGGCCCCCTCTCTTGC'   
structure = dot_bracket(fold(seq))                    
energy = dg(seq) 
len(seq) != len(structure)            # 26 != 19
structure == '..........(((...)))'
energy == 0.4

RNAFold structure:

with dG: -0.95

and mfold structure:

with dG: -0.27

Am I using it wrong or is something off here?

Thanks!

lattice-automation / seqfold Goto Github PK

seqfold's People

Contributors

Stargazers

Watchers

Forkers

seqfold's Issues

my own TODO

Also, one line I needed in python demo

Without that one extra line I get NameError...

Recommend Projects

Recommend Topics

Recommend Org

Without that one extra line I get `NameError...`