Coder Social home page Coder Social logo

xander-hmmgs's Introduction

Using HMMgs:
    See detailed step-by-step instructions in Xander_assembler repository (https://github.com/rdpstaff/Xander_assembler)

Build - Build a De Bruijn graph from from a set of reads
	java -jar hmmgs.jar build <read_file> <bloom_out> <kmerSize> <bloomSizeLog2> [cutoff = 2] [# hashCount = 4] [bitsetSizeLog2 = 30]
        read_file
             fasta or fastq files containing the reads to build the graph from 
        bloom_out
             file to write the bloom filter to 
        kmerSize
            should be multiple of 3, (recommend 45, minimum 30, maximum 63) 
        bloomSizeLog2
            the size of the bloom filter (or memory needed) is 2^bloomSizeLog2 bits, increase if the predicted false positive rate is greater than 1%
        cutoff
            minimum number of times a kmer has to be observed in SEQFILE to be included in the final bloom filter
        hashCount
            number of hash functions, recommend 4
        bitsetSizeLog2
            the size of one bitSet 2^bitsetSizeLog2, recommend 30

    The bloom filter stats such as bloom filter predicted false positive rate is written to stdout. 

Search - Perform local assembly starting at the given start points in a given de Bruijn Graph 
	output files <kmers>_nucl.fasta, _prot.fasta, search stats written to stdout
    java -jar hmmgs.jar search [-h] [-u] [-p <n_nodes>] <k> <limit_in_seconds> <bloom_filter> <for_hmm> <rev_hmm> <kmers>
        -u
            don't normalize the hmm input
        -p  n_nodes 
            prune the search if the score does not improve after n_nodes (default 20, set to 0 to disable pruning)
        k
            number of best local assemblies to return for each kmer
        limit_in_seconds
            dtime limit for individual searches (conservative suggestion = 100)
        bloom_filter
            bloom filter built using hmmgs build
        for_hmm, rev_hmm
            hidden markov models, HMMER3 format
        kmers
            starting points (can use KmerFilter's fast_kmer_filter to identify starting points)
        [#threads] experimental, suggested 1 (not thoroughly tested)

Merge - Merge the left and right contigs generated by hmmgs search
	java -jar hmmgs.jar merge [options] <hmm> <hmmgs_file> <nucl_contig>
        -a,--all                Generate all combinations for multiple paths for each starting kmer, instead of just the best
        -b,--min-bits <arg>     Minimum bits score
        -l,--min-length <arg>   Minimum length
        -o,--out <arg>          Write output to file instead of stdout

KmerFilter:
	fast_kmer_filter - search a set of reads against a set of reference sequences to identify starting points for assembly
	java -jar KmerFilter.jar fast_kmer_filter <kmerSize> <query_file> [name=]<ref_file> ...
        -a,--aligned              Build trie from aligned sequences
        -o,--out <arg>            Redirect output to file
        -T,--transl-table <arg>   Translation table to use when translating
                                  nucleotide to protein sequences
        -t,--threads <arg>        #Threads to use

         <kmerSize> kmer length, should be multiple of 3, (recommend 45, minimum 30, maximum 63) 
         <query_file> read file to search for starting points in (use the same fasta file used to build the De Bruijn Graph)
         1 or more aligned reference files (aligned using the same HMM that will be used to search) with an optional reference name (ie nifh=my_nifh_refs_aligned.fasta)


Other uses:
     HMMgs can also be used to extract subgraphs from starting points instead of contigs to perform further analysis with (see edu.msu.cme.rdp.graph.GraphSearch)
     HMMgs can also be used to compute base coverage for contigs (generated by hmmgs or other programs) (see edu.msu.cme.rdp.graph.abundance.ReadKmerMapper and base_coverage.py)

NOTES:
     When using fast_kmer_filter to identify start points there are two things to be aware of.
       1. While the Bloom Filter Builder allows any k-size (hmmgs requiers a k divisible by 3 however), fast_kmer_filter requires k <= 63
       2. fast_kmer_filter allows for multiple gene starting points to be searched for at the same time (since each requires a scan over the read file it is faster to do every gene at once), however this means the output file is multiplexed and must be demultiplexed before used in hmmgs search.  This can be done with the following command: grep 'gene_name' <multiplexed_starts_file> | cut -f2- > <demultiplexed_gene_start_points>

xander-hmmgs's People

Contributors

gunturus avatar wangqion avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.