lattice-automation / seqfold Goto Github PK
View Code? Open in Web Editor NEWnucleic acid folding
License: MIT License
nucleic acid folding
License: MIT License
When trying to run DG on a sequence that is 1448 base pairs long, I encounter the following error:
RecursionError: maximum recursion depth exceeded in comparison
with the traceback:
Traceback (most recent call last):
File "C:\Users\Yash Lal\Desktop\Coevolution-Informatics\playground2.py", line 5, in
v = dg(s)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 89, in dg
structs = fold(seq, temp)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 69, in fold
v_cache, w_cache = _cache(seq, temp)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 161, in _cache
_w(seq, 0, n - 1, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
[Previous line repeated 989 more times]
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 191, in _w
if w_cache[i][j] != STRUCT_DEFAULT:
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 24, in eq
return self.e == other.e and self.ij == other.ij
RecursionError: maximum recursion depth exceeded in comparison
Is there anything I can do to make dg work with such a long sequence?
Thank you!
Useful for things like primer design where the user is going to query many possible start + end positions, checking each for a high delta g, ie low foldability
Hello,
Any plans to include oligomer corrections for ionic conditions (Na+ and Mg++), similar to mFold?
Thanks!
Great work! I may try to tweak to get output in .vienna format which has the mfe appened on same line as dot-brackets. I want to count bulges, and one way to do so involves converting .vienna dot-bracket to .ct (connect) format. Seems doable ;-)
Single-char Ttpo in README.md where I think seqfold -v -l output should have ddg not dg that is: i j ddg description
not i j dg description
BTW what does ddg stand for? May also want to mention version(s) of python 3.x required. I think typing
did not start showing up until 3.5 or 3.6, right? And it works differently in 3.9 per https://stackoverflow.com/questions/57505071/nameerror-name-list-is-not-defined/
Note I had to add one extra line to the python demo in README.md
from typing import List
Adding above line got rid of NameError...
as shown below. I just pasted python demo from README.md
into fileseqfold-demo.py
file.
$ head -4 seqfold-demo.py. # after adding `from typing import List` line
from seqfold import dg, dg_cache, fold, Struct
from typing import List
# just returns minimum free energy
$ python. seqfold-demo.py
-15.200000000000001
0 48 -2.1 STACK:GG/CC
1 47 -2.1 STACK:GG/CC
2 46 -1.3 STACK:GA/CT
3 45 -1.3 STACK:AG/TC
4 44 -2.1 STACK:GG/CC
5 43 -1.5 STACK:GT/CA
6 42 -1.3 STACK:TC/AG
7 41 -0.2 BIFURCATION:4n/3h
9 22 -1.0 STACK:TT/AA
10 21 -0.9 STACK:TA/AT
11 20 -1.5 STACK:AC/TG
12 19 3.1 HAIRPIN:CA/GG
25 39 -2.1 STACK:CC/GG
26 38 -2.2 STACK:CG/GC
27 37 -2.1 STACK:GG/CC
28 36 3.4 HAIRPIN:GT/CT
NameError...
$ head -3 seqfold-demo.py # without `from typing import List` line
from seqfold import dg, dg_cache, fold, Struct
# just returns minimum free energy
$ python seqfold-demo.py
Traceback (most recent call last):
File "seqfold-demo.py", line 7, in <module>
structs: List[Struct] = fold("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC")
NameError: name 'List' is not defined
$ python --version
Python 3.8.8
I tried to Numba jit the dg
function as a longshot and it appears that it failed. Have y'all tried Numba compiling any parts of this to increase the performance?
Hi, I was just tried to work with the following example from the README file:
just returns minimum free energy
dg("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC", temp = 37.0) # -12.94
And the result I got is -13.4 (and not -12.94 as stated in the example).
Could this be due to dependencies or something else?
Thanks
Hello there, seqfold is now on bioconda. You could add this to your readme:
conda install -c bioconda seqfold
Great packages btw!
I like this minimalist format:
SEQ --------///---\\\---
STR GTGGCCGCGGCGAAGCCTCT
Hi, I really like the package. A good alternative to Vienna (which isn't on pypi).
But when verifing your example sequence output against Vienna's RNAfold, it looks like seqfold misses two pair bonds? the lowercased g-c and a-u
GGGAGGUCgUUACAUCUGGGUAAcaCCGGUACUGAUCCGGuGACCUCCC
RNAfold (((((((((((((......)))))(((((.......))))))))))))) (-25.10)
seqfold ((((((((.((((......))))..((((.......)))).)))))))) (-25.5)
Hello,
I'm not sure if this is necessarily a bug, but I was hoping to gain some clarity on why inf values would be returned for some RNA sequences. Note this is not the case if the sequence is converted to DNA.
from seqfold import dg
# RNA sequences
>>> dg("CAGCGCGGCGGGCGGGAGUCCGGCGCGCCCUCCAUCCCCGGCGGCGUCGGCAAGGAGUAG", temp=37)
inf
>>> dg("ACAAGCAGGAGCCUAUAAAUCCCUGAGGGGGCUGCUGGGACAUCACAGAAGGUGAAAGUC", temp=37)
inf
# DNA sequences
>>> dg("CAGCGCGGCGGGCGGGAGTCCGGCGCGCCCTCCATCCCCGGCGGCGTCGGCAAGGAGTAG", temp=37)
-12.6
>>> dg("ACAAGCAGGAGCCTATAAATCCCTGAGGGGGCTGCTGGGACATCACAGAAGGTGAAAGTC", temp=37)
-5.3
Hello,
For the given sequence ATATCTAGGTAGATAGCTAAGTAAATAACTACGTACATACCTCTCGGTCGATCGCTCAGTCAATCACTCCGTCCATCCCGGGGAGGGCGGAAGGACGGCAGGCCGAGAGCGAAAGAACGACAGACCGCGCAAGCACGCCAGCCCAAAACAACCACACCCCTTTTGTTTATTTCTTGGTTGATTGCTTAGTTAATTACTTC
, the output of dot_bracket
function with temp=55, the resulting dot_bracket result: ........................................((.(((.((..((((((...........((((((.(((..........))).))))))........))))))..)).))).))
does not have the same length
hello,
Thank you very much for the tool you developed to predict the secondary structure of a sequence. However, if we need to predict the probability of a sequence maintaining a linear structure, how should we calculate it?
ilead-cong
Hi, I think your values are entered wrong.
Below are corrected values based on The Thermodynamics of DNA Structural Motifs
Annual Review of Biophysics and Biomolecular Structure
Vol. 33:415-440 (Volume publication date 9 June 2004)
https://doi.org/10.1146/annurev.biophys.32.110601.141800
'''
DNA_NN: BpEnergy = {
"init": (0.2, -5.7),
"init_G/C": (0.0, 0.0),
"init_A/T": (2.2, 6.9),
"sym": (0, -1.4),
"AA/TT": (-7.6, -21.3),
"AT/TA": (-7.2, -20.4),
"TA/AT": (-7.2, -21.3),
"CA/GT": (-8.5, -22.7),
"GT/CA": (-8.4, -22.4),
"CT/GA": (-7.8, -21.0),
"GA/CT": (-8.2, -22.2),
"CG/GC": (-10.6, -27.2),
"GC/CG": (-9.8, -24.4),
"GG/CC": (-8.0, -19.9),
}
'''
If this is not correct could you explain your values?
Thanks for putting this together! It is a great resource!
I have been using seqfold to design oligo constructs, but have run into issues when I use degenerate bases (like in unique molecular identifiers, UMIs). Would there be a straightforward way to allow for degenerate bases? A minimal version of this might be to ignore their interactions, but include them in the backbone? Thanks!
Hi! Thank you for implementing this! There is quite a lack of open source tools for energy calculations, so this fills a much-needed niche.
Would you ever consider adding a DOI to make it easier to cite?
I think this repository/package is increasingly being sought out due to a lack of python bindings for other tools, and I imagine some people would like to cite it in papers (perhaps even citing a particular release for reproducibility). If you're interested/willing, Github makes DOI registration fairly straightforward when paired with Zenodo.
Attempt at perf improvements.
Hello,
Thank you for making and maintaining this tool.
I'm trying to figure out the best folding for a sequence of unknown length, so I'm trying all reasonable sequence lengths 2-100 nt
the issue is that for certain lengths, seqfold is unable to find more than a single paired base, even though there are plenty of hairpins in the sequence.
in the attached image, each line is the dotbrackt view of one seqfold run, each with +1nt of the same sequence
do you know what this software behaviour could be caused by?
Best
Tue
Hello and thanks so much for writing accessible code! It's well needed.
I am coming from using UNAfold online, which is luckily still around on IDT's website right now, but I worry that it won't necessarily be there forever. I'm also interested in coding with these programs so UNAfold is out, since the owner of it hasn't responded to anything and I can't find their code anywhere. But when I was using seqfold, I noticed a bunch of differences. Is there a good explanation for this? I trust you, but I'm just not familiar with exactly where all the ambiguity may be coming from. Do you believe that your numbers are more accurate?
Back onto using UNAfold, is there a feature, where I can see the folding that occurs? When I run the fold command in seqfold I just see the base pairing that will occur. Is this for the most likely structure? When I run UNAfold, I get the dG and base pairing for multiple different folded structures.
I also think a brief manual explaining what i and j are would definitely help a bit. Again thanks for all your code and have a nice day!
A week later I tried to use seqfold to calculate MFE and secondary structure but it took too much time for an HTTP request through a serverless environment (waiting time is normally limited to 30-60 seconds for performance reasons). My use case was specifically for vaccine RNA sequences, so most of them are greater than 3kbp.
So I used Linear Fold by running a bash script to build the project and putting them inside a folder of my project and using some python wrapping to run the command line LinearFold binary and parsing the output. It was much faster and
I think I can do a better job by really taking the BeamCKYParser
function and wrapping using some lib like CPython, however, it will take some time.
What do you think?
For example:
from seqfold import fold, dot_bracket, dg
seq = 'AAAAAAAAAAAGGCCCCCTCTCTTGC'
structure = dot_bracket(fold(seq))
energy = dg(seq)
len(seq) != len(structure) # 26 != 19
structure == '..........(((...)))'
energy == 0.4
RNAFold structure:
with dG: -0.95
and mfold structure:
with dG: -0.27
Am I using it wrong or is something off here?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.