Coder Social home page Coder Social logo

jtfouquier / ghost-tree Goto Github PK

View Code? Open in Web Editor NEW
27.0 7.0 20.0 18.9 MB

creating hybrid-gene phylogenetic trees for diversity analyses

License: BSD 3-Clause "New" or "Revised" License

Python 84.03% Jupyter Notebook 15.97%
python python3 phylogenetics phylogenetic-trees diversity fungi fungal microbiome microbiology microbial-ecology

ghost-tree's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ghost-tree's Issues

Missing node lengths on phylogenetic trees & node.name is returning values

The branch lengths are coming directly from FastTree output or being appended using skbio's append method in TreeNode. When FastTree determines there is no length it gives it a value of "0.0" so I am not sure why there are lengths of "none". Maybe this is the skbio TreeNode? I'm not sure.

I checked a backbone tree with skbio (which is just 18S aligned sequences run on FastTree without any other modification or skbio methods) and it looks like it's outputting a length for some node.names when I traverse through the tree. So I am not sure what's going.

I also ran the same script as below on a full ghost-tree .nwk and it looks like what it's doing is adding the "None" value to every insertion it makes into the backbone. So possibly everywhere there is an insertion of a "mini" tree, there is a "None" value too.

I'll email some test files.

I used this to check the backbone tree:

from skbio import TreeNode

backbone_tree = TreeNode.read("nr_backbone_tree_gt.nwk", format="newick")

output = open("backbone_tree_analysis_file_032915.txt", "w")

for node in backbone_tree.traverse():
output.write(str(node.name)+"\t"+str(node.length)+"\n")

output.close()

Small output from analysis file (first column is node.name, 2nd is node.length):

None None
0.788 0.00128
0.806 0.00053
0.981 0.00055
AY665783.1.1694 0.00137
0.044 0.00055
AY657010.1.1771 0.00131
AF426952.1.1784 0.00618
0.737 0.00064
0.973 0.00054
0.968 0.00055
AY771602.1.1776 0.00066
0.492 0.00055
0.882 0.00132

FastTree warnings need to be handled differently

Currently ghost-tree scaffold hybrid-tree runs FastTree which in turn generates an error for any non-nucleotide character, for example: Ignored unknown character K (seen 2 times). This means that the terminal is spammed with warnings. Running this command in an IPython notebook will cause the notebook to crash.

This could be dealt with either a with script that filters invalid characters prior to running ghost-tree scaffold hybrid-tree so that ghost-tree scaffold hybrid-tree would fail if any invalid characters were present. Or by a single warning that says "non-nucleotide characters were detected and ignored"

update to scikit-bio 0.2.3

@JTFouquier scikit-bio 0.2.3 went live today, so I recommend updating to this latest version by running pip install scikit-bio --upgrade. setup.py also needs to be updated to depend on scikit-bio >= 0.2.3. This release included bug fixes related to #9. Please see the release notes for more details and let me know if you have any issues with updating. Thanks!

improve most_common_genus complexity (ghosttree.scaffold)

Suggestion from D.M.:

line 126, which determines most_common_genus is doing a list lookup for each name, so the runtime of it is O(N * M) where N is the number of entries in otu_genus_list and M is the number of unique names in that list. This can be reduced to around O(N + M) by casting the list to a Counter, and then max'ing the items.

graft unrooted extension trees as "rooted at midpoint"

Extension trees are currently not rooted at midpoint before they get grafted onto the foundation tree.

From Greg:

  • Jennifier, do you root the extension trees? If not, that is probably something that we should add. Since we're using the tree for diversity calculations, I think midpoint rooting (TreeNode.root_at_midpoint) should suffice for this - you'd just need to add that before grafting. We could probably also get @johnchase to do this instead if you're too tied up with the new job - just let me know.

add license

Need license file at root of repo and in setup.py (in setup function call and PyPI classifiers).

ghosttree.scaffold minor revision

Suggestions from D.M.

  • should use os.mkdir
  • use of globals is probably not necessary and adds complexity
  • there are some blocks of code here which it isn't fully clear what is going on immediately.

fix coverage

Hmm, adding subprocess.Popen has increased the amt of lines that cannot be "covered" by unit testing (muscle and fasttree). Coverage is ~70% now.

Add warning into the docs about potential for clustered sequences to be mislabeled

This shouldn't be an issue in practice as the emphasis is tree structure and branch length. However, the following situation is at least possible, and could lead to sequences in the ghost tree having a different taxon label than what they have in the reference. As such, it might be a good idea to include a warning about using a taxonomy derived off of the ghost-tree.

If OTU A has 10 sequences, 6 of which are labeled FOO, 4 as BAR, and OTU B has 1000 sequences, 501 are labeled FOO, 499 as BAZ, the resulting OTU will represent 1010 sequences, of which only 501 are FOO.

root foundation tree by midpoint

Extensions/tips have already been rooted, but foundation tree has not. Early results show that this doesn't impact analysis much (ANOSIMS + PCoAs), but rooting is necessary.

explore scaling of extension tree branch lengths relative to foundation tree

One thing we could try here would be to add a multiplicative factor to the cli that gives users the ability to multiply branches in the extension tree by that value (though I realize that might be overly simplistic, and I'm open to other suggestions). This value would need to always be greater than zero, and a value of 1 would correspond to no scaling. For the paper, we can then test varying this parameter to determine if it affects downstream diversity metrics in a meaningful way (if the resulting UniFrac distance matrices are highly correlated by Mantel r, then it doesn't), and if so experiment with to optimize our detection of the small and large effect sizes. Thoughts?

Fix Readme

The readme is an rst file, but there is html in it. I suspect this was formerly a markdown document.

Improve log file for ghost-tree

Log file needs to clarify 1) which FastTree errors are referring to the foundation tree, 2) why there are genera without errors 3) improve spacing/return lines.

add install instructions

These should also note what external tools are required, their versions, website URL, and where/when they are required.

analyze the effect of grafting trees in the wrong place

From @rob-knight:

it might be worth testing how bad it is if you graft onto the wrong name: with unifrac it’s probably not that bad so you might consider introducing some error to simulate what happens in the case where you make the wrong decision on a polyphyletic taxon?

improve most_common_genus complexity (ghosttree.scaffold)

Suggestion from D.M.:

line 126, which determines most_common_genus is doing a list lookup for each name, so the runtime of it is O(N * M) where N is the number of entries in otu_genus_list and M is the number of unique names in that list. This can be reduced to around O(N + M) by casting the list to a Counter, and then max'ing the items.

add to existing script documentation

The current script documentation is pretty sparse. It'd be useful to have (at a minimum) a description for each (sub)command and any arguments or options that it accepts.

Add same filter step for extensions as foundation tree

perform filter high entropy positions and gap positions to tree prior to extending... @gregcaporaso I was under the impression that this step was important specifically for trees with excessively long branches and many sequences (i.e. necessary for foundation and not the extension trees). I can add this step later. Wondering how much work it would be to do the automated analyses that we discussed at some point. I'm happy maintaining and improving ghost-tree but the tedious-ness of these redo analyses has been extremely intense. We should discuss this further at some point. Thanks!

Click makes options wrap by single characters; options are difficult to read

Click works great, but it still wraps the command options by single characters. Sometimes there are 4 to 5 options in ghost-tree, making reading options very difficult. I’ve asked Click how the bug fix is going, and they have a workaround but haven't accepted it as a PR. Can ghost-tree be packaged with a modified version of Click (not released by Click)?

Click Issue:
pallets/click#231 (they gave me directions on how to implement the PR, but I'm not sure it's worth trying due to future packaging of ghost-tree)
Click PR:
pallets/click#240

According to @gregcaporaso via previous email (3/26/15):
"it's better to just leave as-is then try to build a modified version of click"

Example:

screen shot 2015-03-29 at 9 00 45 pm

fix readme formatting

The readme applies to the code under the legacy directory. This content should be moved to a readme under legacy.

ghost-tree filter-alignment-positions works, but unit testing is challenging...

@ebolyen, @jairideout, @johnchase: we worked on filter.py (filter-alignment-positions) in AZ, but I just moved it into 3 functions and got rid of the workaround (aln = aln.omit_gap_positions(1.0 - np.finfo(float).eps)) after skbio was updated.

https://github.com/JTFouquier/ghost-tree/blob/master/ghosttree/filter.py

@ebolyen said it would be difficult to test. o. m. g.

Is it safe to say that enough of this script uses skbio that I don't really need to unit test it? :) If I need to test it, any tips? I have tried unit testing this so many times.

Thanks!

Question about broken build on ghost-tree due to Flake8 error

And I broke the build tonight...but fixed it thanks to Travis. My question is why did Flake8 not show an error locally for setup.py, but then when I pushed a commit Travis was not happy after checking the build (I had no reason whatsoever to mess with the setup.py file until it gave the error on Travis). It doesn't look like there were any Flake8 updates... any thoughts? I'm just curious...if you don't know, just ignore this..... 😱 @jairideout @ebolyen

screen shot 2015-02-11 at 11 20 06 pm

Location to host finished ghost-tree .nwk files for users to download

Is there a more professional website for hosting the finalized .nwk ghost-tree files than google drive or drop box? It would be easy for me to make a variety of the .nwk trees (combinations from different versions of UNITE/SILVA) and have them available. For example, the UNITE DB for the .nwk I used was from a slightly older version of UNITE because that was what I used for analyzing my original ITS data a few years ago now.

Thanks!

@jairideout and @gregcaporaso:

Improve code complexity in ghosttree.scaffold

Suggestion from D.M.:
line 102, which determines which sequences to write out, is performing a list look up. This can be reduced to O(1). It is also not using the skbio fasta formatter.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.