The ghost-tree from jtfouquier

ghost-tree output files are in a folder now; provide alternative option to name files

ghost-tree output files should be in a folder (this might be a better option than the user selecting a name for all of their files...).

Allow option for Generalized Time Reversible (GTR) model to be selected instead of Jukes Cantor

Might be nice to add this as a feature in the future.

Missing node lengths on phylogenetic trees & node.name is returning values

The branch lengths are coming directly from FastTree output or being appended using skbio's append method in TreeNode. When FastTree determines there is no length it gives it a value of "0.0" so I am not sure why there are lengths of "none". Maybe this is the skbio TreeNode? I'm not sure.

I checked a backbone tree with skbio (which is just 18S aligned sequences run on FastTree without any other modification or skbio methods) and it looks like it's outputting a length for some node.names when I traverse through the tree. So I am not sure what's going.

I also ran the same script as below on a full ghost-tree .nwk and it looks like what it's doing is adding the "None" value to every insertion it makes into the backbone. So possibly everywhere there is an insertion of a "mini" tree, there is a "None" value too.

I'll email some test files.

I used this to check the backbone tree:

from skbio import TreeNode

backbone_tree = TreeNode.read("nr_backbone_tree_gt.nwk", format="newick")

output = open("backbone_tree_analysis_file_032915.txt", "w")

for node in backbone_tree.traverse():
output.write(str(node.name)+"\t"+str(node.length)+"\n")

output.close()

Small output from analysis file (first column is node.name, 2nd is node.length):

None None
0.788 0.00128
0.806 0.00053
0.981 0.00055
AY665783.1.1694 0.00137
0.044 0.00055
AY657010.1.1771 0.00131
AF426952.1.1784 0.00618
0.737 0.00064
0.973 0.00054
0.968 0.00055
AY771602.1.1776 0.00066
0.492 0.00055
0.882 0.00132

FastTree warnings need to be handled differently

Currently ghost-tree scaffold hybrid-tree runs FastTree which in turn generates an error for any non-nucleotide character, for example: Ignored unknown character K (seen 2 times). This means that the terminal is spammed with warnings. Running this command in an IPython notebook will cause the notebook to crash.

This could be dealt with either a with script that filters invalid characters prior to running ghost-tree scaffold hybrid-tree so that ghost-tree scaffold hybrid-tree would fail if any invalid characters were present. Or by a single warning that says "non-nucleotide characters were detected and ignored"

update to scikit-bio 0.2.3

@JTFouquier scikit-bio 0.2.3 went live today, so I recommend updating to this latest version by running pip install scikit-bio --upgrade. setup.py also needs to be updated to depend on scikit-bio >= 0.2.3. This release included bug fixes related to #9. Please see the release notes for more details and let me know if you have any issues with updating. Thanks!

Build new trees for new UNITE DBs & updated ghost-tree code

improve most_common_genus complexity (ghosttree.scaffold)

Suggestion from D.M.:

line 126, which determines most_common_genus is doing a list lookup for each name, so the runtime of it is O(N * M) where N is the number of entries in otu_genus_list and M is the number of unique names in that list. This can be reduced to around O(N + M) by casting the list to a Counter, and then max'ing the items.

graft unrooted extension trees as "rooted at midpoint"

Extension trees are currently not rooted at midpoint before they get grafted onto the foundation tree.

From Greg:

Jennifier, do you root the extension trees? If not, that is probably something that we should add. Since we're using the tree for diversity calculations, I think midpoint rooting (TreeNode.root_at_midpoint) should suffice for this - you'd just need to add that before grafting. We could probably also get @johnchase to do this instead if you're too tied up with the new job - just let me know.

add license

Need license file at root of repo and in setup.py (in setup function call and PyPI classifiers).

ghost-tree is "genus" specific, there are cases (e.g., LSU needs order) needing higher levels

ghosttree.scaffold minor revision

Suggestions from D.M.

should use os.mkdir
use of globals is probably not necessary and adds complexity
there are some blocks of code here which it isn't fully clear what is going on immediately.

Add license to top of all files

License must be at top of all files in code.

fix coverage

Hmm, adding subprocess.Popen has increased the amt of lines that cannot be "covered" by unit testing (muscle and fasttree). Coverage is ~70% now.

Test different marker gene DBs (not UNITE and SILVA) for broad use of ghost-tree

Test different databases (not UNITE and SILVA) for broad use of ghost-tree

Add warning into the docs about potential for clustered sequences to be mislabeled

This shouldn't be an issue in practice as the emphasis is tree structure and branch length. However, the following situation is at least possible, and could lead to sequences in the ghost tree having a different taxon label than what they have in the reference. As such, it might be a good idea to include a warning about using a taxonomy derived off of the ghost-tree.

If OTU A has 10 sequences, 6 of which are labeled FOO, 4 as BAR, and OTU B has 1000 sequences, 501 are labeled FOO, 499 as BAZ, the resulting OTU will represent 1010 sequences, of which only 501 are FOO.

fix SUMACLUST link

Find SUMACLUST broken link and check installation directions.

root foundation tree by midpoint

Extensions/tips have already been rooted, but foundation tree has not. Early results show that this doesn't impact analysis much (ANOSIMS + PCoAs), but rooting is necessary.

Update "get_otus... " to maintain underscores (not default in skbio)

link between ghost trees and specific UNITE files should be more apparent

@johnchase and @karenschwarzberg both got tripped up by this.

Output tips only for filter_otus_from_otu_table.py in QIIME

Accession IDS (tips) are needed for filtering the .biom table for filter_otus_from_otu_table.py in QIIME workflow prior to using beta_diversity_through_plots.py. This needs to be automatic output.

explore scaling of extension tree branch lengths relative to foundation tree

One thing we could try here would be to add a multiplicative factor to the cli that gives users the ability to multiply branches in the extension tree by that value (though I realize that might be overly simplistic, and I'm open to other suggestions). This value would need to always be greater than zero, and a value of 1 would correspond to no scaling. For the paper, we can then test varying this parameter to determine if it affects downstream diversity metrics in a meaningful way (if the resulting UniFrac distance matrices are highly correlated by Mantel r, then it doesn't), and if so experiment with to optimize our detection of the small and large effect sizes. Thoughts?

Fix Readme

The readme is an rst file, but there is html in it. I suspect this was formerly a markdown document.

Improve log file for ghost-tree

Log file needs to clarify 1) which FastTree errors are referring to the foundation tree, 2) why there are genera without errors 3) improve spacing/return lines.

add install instructions

These should also note what external tools are required, their versions, website URL, and where/when they are required.

analyze the effect of grafting trees in the wrong place

From @rob-knight:

it might be worth testing how bad it is if you graft onto the wrong name: with unifrac it’s probably not that bad so you might consider introducing some error to simulate what happens in the case where you make the wrong decision on a polyphyletic taxon?

fix flake8 errors

Noticed that there are flake8 errors. I'll work on fixing this.

Script completion "notification" clarification

When ghost-tree is completed, make it more clear in the terminal.

improve most_common_genus complexity (ghosttree.scaffold)

Suggestion from D.M.:

line 126, which determines most_common_genus is doing a list lookup for each name, so the runtime of it is O(N * M) where N is the number of entries in otu_genus_list and M is the number of unique names in that list. This can be reduced to around O(N + M) by casting the list to a Counter, and then max'ing the items.

running ghost-tree tests leaves tmp directory behind

When running ghost-tree's unit tests, a tmp/ directory is created in the current working directory but is not cleaned up. The directory contains a single file (mini_seq_gt.fasta).

Fix subprocess.communicate for FastTree installation (var issue)

Fix subprocess.communicate for FastTree installation (var issue)... issue reported by user.

add to existing script documentation

The current script documentation is pretty sparse. It'd be useful to have (at a minimum) a description for each (sub)command and any arguments or options that it accepts.

Add same filter step for extensions as foundation tree

perform filter high entropy positions and gap positions to tree prior to extending... @gregcaporaso I was under the impression that this step was important specifically for trees with excessively long branches and many sequences (i.e. necessary for foundation and not the extension trees). I can add this step later. Wondering how much work it would be to do the automated analyses that we discussed at some point. I'm happy maintaining and improving ghost-tree but the tedious-ness of these redo analyses has been extremely intense. We should discuss this further at some point. Thanks!

Try adding a Click progress bar (for large files ghost-tree takes some time)

via Click http://click.pocoo.org/4/utils/

Click makes options wrap by single characters; options are difficult to read

Click works great, but it still wraps the command options by single characters. Sometimes there are 4 to 5 options in ghost-tree, making reading options very difficult. I’ve asked Click how the bug fix is going, and they have a workaround but haven't accepted it as a PR. Can ghost-tree be packaged with a modified version of Click (not released by Click)?

Click Issue:
pallets/click#231 (they gave me directions on how to implement the PR, but I'm not sure it's worth trying due to future packaging of ghost-tree)
Click PR:
pallets/click#240

According to @gregcaporaso via previous email (3/26/15):
"it's better to just leave as-is then try to build a modified version of click"

Example:

fix readme formatting

The readme applies to the code under the legacy directory. This content should be moved to a readme under legacy.

ghost-tree filter-alignment-positions works, but unit testing is challenging...

@ebolyen, @jairideout, @johnchase: we worked on filter.py (filter-alignment-positions) in AZ, but I just moved it into 3 functions and got rid of the workaround (aln = aln.omit_gap_positions(1.0 - np.finfo(float).eps)) after skbio was updated.

https://github.com/JTFouquier/ghost-tree/blob/master/ghosttree/filter.py

@ebolyen said it would be difficult to test. o. m. g.

Is it safe to say that enough of this script uses skbio that I don't really need to unit test it? :) If I need to test it, any tips? I have tried unit testing this so many times.

Thanks!

Question about broken build on ghost-tree due to Flake8 error

And I broke the build tonight...but fixed it thanks to Travis. My question is why did Flake8 not show an error locally for setup.py, but then when I pushed a commit Travis was not happy after checking the build (I had no reason whatsoever to mess with the setup.py file until it gave the error on Travis). It doesn't look like there were any Flake8 updates... any thoughts? I'm just curious...if you don't know, just ignore this..... 😱 @jairideout @ebolyen

Use ghost-tree and perform simsam.py testing using QIIME

See previous email/documentation for test cases and simulated/mock communities.

Travis is not running flake8 over scripts directory

D.M. found issue:
Travis is not running flake8 over the scripts directory. The ghost-tree script is not pep8 compliant.

keep current check for "g" but if some lines don't contain it, then assume gunidentified

user report

Add option for user to scale branch lengths

Allow phylogenetic tree as a foundation input instead of only foundation alignment .fasta

Due to SILVA providing alignments, it was assumed that an alignment would be provided, but in some cases users might want to provide a .nwk tree as the foundation.

Location to host finished ghost-tree .nwk files for users to download

Is there a more professional website for hosting the finalized .nwk ghost-tree files than google drive or drop box? It would be easy for me to make a variety of the .nwk trees (combinations from different versions of UNITE/SILVA) and have them available. For example, the UNITE DB for the .nwk I used was from a slightly older version of UNITE because that was what I used for analyzing my original ITS data a few years ago now.

Thanks!

@jairideout and @gregcaporaso:

Add/organize files for revision of ghost-tree paper

add real and simulated OTU tables, along with the commands used for simulation

This will allow us to automate the analyses done in the paper, and will support some of the additional analyses that @rob-knight and @wasade suggested in our discussions with them. @JTFouquier, I can do the analyses if you can provide those files and the commands that you used for simulation and distance matrix creation.

jtfouquier / ghost-tree Goto Github PK

ghost-tree's People

Stargazers

Watchers

Forkers

ghost-tree's Issues

Recommend Projects

Recommend Topics

Recommend Org