jtfouquier / ghost-tree Goto Github PK
View Code? Open in Web Editor NEWcreating hybrid-gene phylogenetic trees for diversity analyses
License: BSD 3-Clause "New" or "Revised" License
creating hybrid-gene phylogenetic trees for diversity analyses
License: BSD 3-Clause "New" or "Revised" License
ghost-tree output files should be in a folder (this might be a better option than the user selecting a name for all of their files...).
Might be nice to add this as a feature in the future.
The branch lengths are coming directly from FastTree output or being appended using skbio's append method in TreeNode. When FastTree determines there is no length it gives it a value of "0.0" so I am not sure why there are lengths of "none". Maybe this is the skbio TreeNode? I'm not sure.
I checked a backbone tree with skbio (which is just 18S aligned sequences run on FastTree without any other modification or skbio methods) and it looks like it's outputting a length for some node.names when I traverse through the tree. So I am not sure what's going.
I also ran the same script as below on a full ghost-tree .nwk and it looks like what it's doing is adding the "None" value to every insertion it makes into the backbone. So possibly everywhere there is an insertion of a "mini" tree, there is a "None" value too.
I'll email some test files.
I used this to check the backbone tree:
from skbio import TreeNode
backbone_tree = TreeNode.read("nr_backbone_tree_gt.nwk", format="newick")
output = open("backbone_tree_analysis_file_032915.txt", "w")
for node in backbone_tree.traverse():
output.write(str(node.name)+"\t"+str(node.length)+"\n")
output.close()
Small output from analysis file (first column is node.name, 2nd is node.length):
None None
0.788 0.00128
0.806 0.00053
0.981 0.00055
AY665783.1.1694 0.00137
0.044 0.00055
AY657010.1.1771 0.00131
AF426952.1.1784 0.00618
0.737 0.00064
0.973 0.00054
0.968 0.00055
AY771602.1.1776 0.00066
0.492 0.00055
0.882 0.00132
Currently ghost-tree scaffold hybrid-tree
runs FastTree which in turn generates an error for any non-nucleotide character, for example: Ignored unknown character K (seen 2 times)
. This means that the terminal is spammed with warnings. Running this command in an IPython notebook will cause the notebook to crash.
This could be dealt with either a with script that filters invalid characters prior to running ghost-tree scaffold hybrid-tree
so that ghost-tree scaffold hybrid-tree
would fail if any invalid characters were present. Or by a single warning that says "non-nucleotide characters were detected and ignored"
@JTFouquier scikit-bio 0.2.3 went live today, so I recommend updating to this latest version by running pip install scikit-bio --upgrade
. setup.py
also needs to be updated to depend on scikit-bio >= 0.2.3
. This release included bug fixes related to #9. Please see the release notes for more details and let me know if you have any issues with updating. Thanks!
Suggestion from D.M.:
line 126, which determines most_common_genus is doing a list lookup for each name, so the runtime of it is O(N * M) where N is the number of entries in otu_genus_list
and M is the number of unique names in that list. This can be reduced to around O(N + M) by casting the list to a Counter, and then max'ing the items.
Extension trees are currently not rooted at midpoint before they get grafted onto the foundation tree.
From Greg:
Need license file at root of repo and in setup.py (in setup
function call and PyPI classifiers).
Suggestions from D.M.
License must be at top of all files in code.
Hmm, adding subprocess.Popen has increased the amt of lines that cannot be "covered" by unit testing (muscle and fasttree). Coverage is ~70% now.
Test different databases (not UNITE and SILVA) for broad use of ghost-tree
This shouldn't be an issue in practice as the emphasis is tree structure and branch length. However, the following situation is at least possible, and could lead to sequences in the ghost tree having a different taxon label than what they have in the reference. As such, it might be a good idea to include a warning about using a taxonomy derived off of the ghost-tree.
If OTU A has 10 sequences, 6 of which are labeled FOO, 4 as BAR, and OTU B has 1000 sequences, 501 are labeled FOO, 499 as BAZ, the resulting OTU will represent 1010 sequences, of which only 501 are FOO.
Find SUMACLUST broken link and check installation directions.
Extensions/tips have already been rooted, but foundation tree has not. Early results show that this doesn't impact analysis much (ANOSIMS + PCoAs), but rooting is necessary.
Update "get_otus... " to maintain underscores (not default in skbio)
@johnchase and @karenschwarzberg both got tripped up by this.
Accession IDS (tips) are needed for filtering the .biom table for filter_otus_from_otu_table.py in QIIME workflow prior to using beta_diversity_through_plots.py. This needs to be automatic output.
One thing we could try here would be to add a multiplicative factor to the cli that gives users the ability to multiply branches in the extension tree by that value (though I realize that might be overly simplistic, and I'm open to other suggestions). This value would need to always be greater than zero, and a value of 1 would correspond to no scaling. For the paper, we can then test varying this parameter to determine if it affects downstream diversity metrics in a meaningful way (if the resulting UniFrac distance matrices are highly correlated by Mantel r, then it doesn't), and if so experiment with to optimize our detection of the small and large effect sizes. Thoughts?
The readme is an rst file, but there is html in it. I suspect this was formerly a markdown document.
Log file needs to clarify 1) which FastTree errors are referring to the foundation tree, 2) why there are genera without errors 3) improve spacing/return lines.
These should also note what external tools are required, their versions, website URL, and where/when they are required.
From @rob-knight:
it might be worth testing how bad it is if you graft onto the wrong name: with unifrac it’s probably not that bad so you might consider introducing some error to simulate what happens in the case where you make the wrong decision on a polyphyletic taxon?
Noticed that there are flake8 errors. I'll work on fixing this.
When ghost-tree is completed, make it more clear in the terminal.
Suggestion from D.M.:
line 126, which determines most_common_genus is doing a list lookup for each name, so the runtime of it is O(N * M) where N is the number of entries in otu_genus_list
and M is the number of unique names in that list. This can be reduced to around O(N + M) by casting the list to a Counter, and then max'ing the items.
When running ghost-tree's unit tests, a tmp/
directory is created in the current working directory but is not cleaned up. The directory contains a single file (mini_seq_gt.fasta
).
Fix subprocess.communicate for FastTree installation (var issue)... issue reported by user.
The current script documentation is pretty sparse. It'd be useful to have (at a minimum) a description for each (sub)command and any arguments or options that it accepts.
perform filter high entropy positions and gap positions to tree prior to extending... @gregcaporaso I was under the impression that this step was important specifically for trees with excessively long branches and many sequences (i.e. necessary for foundation and not the extension trees). I can add this step later. Wondering how much work it would be to do the automated analyses that we discussed at some point. I'm happy maintaining and improving ghost-tree but the tedious-ness of these redo analyses has been extremely intense. We should discuss this further at some point. Thanks!
via Click http://click.pocoo.org/4/utils/
Click works great, but it still wraps the command options by single characters. Sometimes there are 4 to 5 options in ghost-tree, making reading options very difficult. I’ve asked Click how the bug fix is going, and they have a workaround but haven't accepted it as a PR. Can ghost-tree be packaged with a modified version of Click (not released by Click)?
Click Issue:
pallets/click#231 (they gave me directions on how to implement the PR, but I'm not sure it's worth trying due to future packaging of ghost-tree)
Click PR:
pallets/click#240
According to @gregcaporaso via previous email (3/26/15):
"it's better to just leave as-is then try to build a modified version of click"
Example:
The readme applies to the code under the legacy
directory. This content should be moved to a readme under legacy
.
@ebolyen, @jairideout, @johnchase: we worked on filter.py (filter-alignment-positions) in AZ, but I just moved it into 3 functions and got rid of the workaround (aln = aln.omit_gap_positions(1.0 - np.finfo(float).eps)) after skbio was updated.
https://github.com/JTFouquier/ghost-tree/blob/master/ghosttree/filter.py
@ebolyen said it would be difficult to test. o. m. g.
Is it safe to say that enough of this script uses skbio that I don't really need to unit test it? :) If I need to test it, any tips? I have tried unit testing this so many times.
Thanks!
And I broke the build tonight...but fixed it thanks to Travis. My question is why did Flake8 not show an error locally for setup.py, but then when I pushed a commit Travis was not happy after checking the build (I had no reason whatsoever to mess with the setup.py file until it gave the error on Travis). It doesn't look like there were any Flake8 updates... any thoughts? I'm just curious...if you don't know, just ignore this..... 😱 @jairideout @ebolyen
See previous email/documentation for test cases and simulated/mock communities.
D.M. found issue:
Travis is not running flake8 over the scripts directory. The ghost-tree script is not pep8 compliant.
user report
Allow phylogenetic tree as a foundation input instead of only foundation alignment .fasta
Due to SILVA providing alignments, it was assumed that an alignment would be provided, but in some cases users might want to provide a .nwk tree as the foundation.
Is there a more professional website for hosting the finalized .nwk ghost-tree files than google drive or drop box? It would be easy for me to make a variety of the .nwk trees (combinations from different versions of UNITE/SILVA) and have them available. For example, the UNITE DB for the .nwk I used was from a slightly older version of UNITE because that was what I used for analyzing my original ITS data a few years ago now.
Thanks!
@jairideout and @gregcaporaso:
This will allow us to automate the analyses done in the paper, and will support some of the additional analyses that @rob-knight and @wasade suggested in our discussions with them. @JTFouquier, I can do the analyses if you can provide those files and the commands that you used for simulation and distance matrix creation.
Position filtering in ghost-tree is affected by floating point precision bugs present in scikit-bio 0.2.2 and the following issue. scikit-bio/scikit-bio#815
use subprocess.Popen instead of os.system
Suggestion from D.M.:
line 102, which determines which sequences to write out, is performing a list look up. This can be reduced to O(1). It is also not using the skbio fasta formatter.
Suggestion from D.M.:
unnecessarily O(N*M), where N is the number of sequences in the foundation alignment, and M is the number of genus level names. It probably can be reduced to O(N) by structuring all_genus_list as a set.
add an image displaying the grafting of the extension trees to foundation. This helped me visualize it a lot...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.