Coder Social home page Coder Social logo

topofind's Introduction

TopoFind

Finding tree topologies from sequence alignments.

Dependencies

iqtree 2.2.3.hmmster
R 4.1.2
devtools 2.4.3
MixtureModelHMM
AMAS

Installation

git clone https://github.com/fredjaya/topofind.git
cd topofind
conda env create -f env.yml
conda activate topofind
Rscript install_pacakges.R
  • Install IQ-TREE manually and add symlink to /bin

Usage

  • Infer the (best number of) topologies from a sequence alignment (test1.fa):
TopoFind.py -a data/test1.fa
  • Set the number of threads (e.g. 4) to use for IQ-TREE:
TopoFind.py -a data/test1.fa -nt 4
  • Show all available options:
TopoFind.py -h  
  • Run unit tests:
python3 -m unittest tests/*

topofind's People

Contributors

fredjaya avatar

Watchers

 avatar

topofind's Issues

sh argument list too long at t=8

.command.sh: line 3: /home/frederickjaya/GitHub/rec-mast/bin/update_run_names.py: Argument list too long

Heaps of ways to partition by t=8. Easy fix is to set ENV ulimit before running nextflow. Better fix is to output each new name in a new channel.

`t1_iqtree_per_split()` only makes one tree when passing `val prefix`

Error thrown in concatenate_trees_for_mast() where only one tree exists.

Only occurs when passing val prefix:

process t1_iqtree_per_split {
    publishDir "${params.out}/${prefix}", mode: "copy"
    
    input:
        val prefix
        path splitted_aln
        val nthreads

    output:
        path '*.model.gz'
        path '*.treefile'
        path '*.log'     
        path '*.iqtree'  

    script:
    """
    iqtree2 -s ${splitted_aln} -pre ${splitted_aln.simpleName} -nt ${nthreads}
    """

workflow {

    t1_iqtree_per_split(params.prefix, split_aln.out[0].flatten(), params.nthreads)

}

Remake trees using MAST-assigned partitions instead of +R2 assignments

When splitting the alignment to b+1 blocks, each occurrence of block splitting is handled separately so that only t+1 trees are input to MAST per step.

When running MAST with t>=4 trees as input, There will be block regions generated from upstream splits that can be reused to avoid rebuilding trees.

However, the sites used to construct single trees from a given block will be reassigned by the HMM after the MAST process, hence the sites are likely to differ between the ones used to construct the tree.

I think trees should be reconstructed with the post-HMM site assignment, even though the same block is shared. I hope this makes sense... (for others and future me)

BIC format error after t=5 mast runs

On concat.fa and set2_paragone data. Might be slurm/scheduling issue

multiprocessing.pool.RemoteTraceback:                                                          
"""                                                                                            
Traceback (most recent call last):                                                             
  File "/home/u1070770/.conda/envs/rec-mast/lib/python3.10/multiprocessing/pool.py", line 125, 
in worker                                                                                      
    result = (True, func(*args, **kwds))                                                       
  File "/home/u1070770/.conda/envs/rec-mast/lib/python3.10/multiprocessing/pool.py", line 51, i
n starmapstar          
    return list(itertools.starmap(args[0], args[1]))                                           
  File "/home/u1070770/rec-mast/run.py", line 133, in mast                                     
    bic=get_bic(iqtree_out_path)               
  File "/home/u1070770/rec-mast/run.py", line 48, in get_bic                                   
    return(float(bic))                         
ValueError: could not convert string to float: ''                                              
"""                                            
                                              

Add catch for lack of informative sites during tree construction

Tree construction fails when partitioned alignment has no informative sites i.e. partition is tiny. This isn't a bad thing, but a proper catch needs to be added to account for this.

Error:

Caused by:
  Process `iqtree_default (1)` terminated with an error exit status (2)

Command executed:

  iqtree2 -s test2_class_1-out.fas -pre test2_class_1-out -nt 1

Command exit status:
  2

Command error:
  ERROR: Only one state is observed in alignment

To reproduce:

nextflow run ~/GitHub/rec-mast/main.nf --aln_path test2.phy --out `pwd`

Account for duplicate topologies

When inputting identical topologies into MAST and using the fully unlinked model, sites will have different weights despite the same topology. Also no need to store the same topology again

Speed up program

Ideas to make the program finish faster.

Currently, only a single instance of either split_aln() or mast() processes are run at once. Slow, but easy to control resource usage and debugging. IQ-TREE threads can be defined through python CLI with -T.

  • Use multiprocessing.pool
  • Continue iterations with the best performing model (will only work when discarding duplicate topologies #14)

-Benchmark the use of -T AUTO vs. arbitrary number i.e. -T 8

Depending on infrastructure, change --num_threads to be --total_threads across all processes
Same with memory usage

  • Just use mp.pool but perhaps add flag, so it only runs on a cluster

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.