Coder Social home page Coder Social logo

gatb / mindthegap Goto Github PK

View Code? Open in Web Editor NEW
35.0 8.0 10.0 1.38 MB

MindTheGap is a SV caller for short read sequencing data dedicated to insertion variants (all sizes and types). It can also be used as a local assembly tool.

License: GNU Affero General Public License v3.0

CMake 1.85% C++ 74.62% Python 15.90% Shell 7.12% Dockerfile 0.51%
bioinformatics genomics gatb debruijn-graph structural-variants

mindthegap's Introduction

MindTheGap

Linux Mac OSX
Build Status Build Status

install with bioconda

License

What is MindTheGap ?

MindTheGap performs detection and assembly of DNA insertion variants in NGS read datasets with respect to a reference genome. It is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. It takes as input a set of reads and a reference genome. It outputs two sets of FASTA sequences: one is the set of breakpoints of detected insertion sites, the other is the set of assembled insertions for each breakpoint.

New ! MindTheGap can also be used as a genome assembly finishing tool: it can fill the gaps between a set of input contigs without any a priori on their relative order and orientation. It outputs the results in a gfa file. It is notably integrated as an essential step in the targeted assembly tool MinYS (MineYourSymbiont in metagenomics datasets, see https://github.com/cguyomar/MinYS).

MindTheGap is a Genscale tool, built upon the GATB C++ library, and developed by:

  • Claire Lemaitre
  • Cervin Guyomar
  • Wesley Delage
  • Guillaume Rizk
  • Former developers: Rayan Chikhi, Pierre Marijon.

Installation instructions

Requirements

CMake 3.1+; see http://www.cmake.org/cmake/resources/software.html

C++/11 capable compiler (e.g. gcc 4.7+, clang 3.5+, Apple/clang 6.0+)

Getting the latest source code with git

# get a local copy of MindTheGap source code
git clone --recursive https://github.com/GATB/MindTheGap.git

# compile the code
cd MindTheGap
sh INSTALL
# the binary file is located in directory build/bin/
./build/bin/MindTheGap -help

Note: when updating your local repository with git pull, if you see that thirdparty/gatb-core has changed, you have to run also : git submodule update.

Installing a stable release

Retrieve a binary archive file from one of the official MindTheGap releases (see "Releases" tab on the Github web page); file name is MindTheGap-vX.Y.Z-bin-Linux.tar.gz (for Linux) or MindTheGap-vX.Y.Z-bin-Darwin.tar.gz (for MacOs).

tar -zxf MindTheGap-vX.Y.Z-bin-Darwin.tar.gz
cd MindTheGap-vX.Y.Z-bin-Darwin
chmod u+x bin/MindTheGap
./bin/MindTheGap -help

In case the software does not run appropriately on your system, you should consider to install it from its source code. Retrieve the source archive file MindTheGap-vX.Y.Z-Source.tar.gz.

tar -zxf MindTheGap-vX.Y.Z-Source.tar.gz
cd MindTheGap-vX.Y.Z-Source
sh INSTALL
# the binary file is located in directory build/bin/
./build/bin/MindTheGap -help

Using conda or docker

MindTheGap is also distributed as a Bioconda package:

conda install -c bioconda mindthegap

Or pull the docker image of MindTheGap (warning: need to be updated with latest releases):

docker pull clemaitr/mindthegap

Small run example

MindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example
MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example

USER MANUAL

Description

MindTheGap is a software that performs integrated detection and assembly of genomic insertion variants in NGS read datasets with respect to a reference genome. It is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome.

Alternatively and since release 2.1.0, MindTheGap can also be used as a genome assembly finishing tool. It is integrated as an essential step in the targeted assembly tool MinYS (MineYourSymbiont in metagenomics datasets). It takes also part of a gap-filling pipeline dedicated to linked-read data (10X Genomics): MTG-link.

Insertion variant detection

It takes as input a set of reads and a reference genome. Its main output is a VCF file, giving for each insertion variant, its insertion site location on the reference genome, a single insertion sequence or a set of candidate insertion sequences (when there are assembly ambiguities), and its genotype in the sample.

For a detailed user manual specific to insertion variants see doc/MindTheGap_insertion_caller.md.

Genome assembly gap-filling (New feature !)

When given a set of reads and a set of contigs as input, MindTheGap tries to fill the gaps between all pairs of contigs by de novo local assembly without any a priori on their relative order and orientation. It outputs the results in gfa file.

For a detailed user manual specific to contig gap-filling see doc/MindTheGap_assembly.md.

Performances

MindTheGap performs de novo assembly using the GATB C++ library and inspired from algorithms from Minia. Hence, the computational resources required to run MindTheGap are significantly lower than that of other assemblers (for instance it uses less than 6GB of main memory for analyzing a full human NGS dataset).

For more details on the method and some recent results, see the web page.

Usage and examples

MindTheGap is composed of two main modules : breakpoint detection (find module) and the local assembly of insertions or gaps (fill module). Both steps are implemented in a single executable, MindTheGap, and can be run independently by specifying the module name as follows :

MindTheGap <module> [module options] 
  1. Basic command lines

     #Find module:
     MindTheGap find (-in <reads.fq> | -graph <graph.h5>) -ref <reference.fa> [options]
     #To get help:
     MindTheGap find -help
     
     #Fill module:
     MindTheGap fill (-in <reads.fq> | -graph <graph.h5>) (-bkpt <breakpoints.fa> | -contig <contigs.fa>) [options]
     #To get help:
     MindTheGap fill -help
    
  2. Examples

    These examples can be run with the small datasets in directory data/

    Example for insertion variant calling:

     #find
     build/bin/MindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example
     # 3 files are generated: 
     #   example.h5 (de bruijn graph), 
     #   example.othervariants.vcf (SNPs and deletion variants), 
     #   example.breakpoints (breakpoints of insertion variants).
     
     #fill
     build/bin/MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example
     # 3 files are generated:
     #   example.insertions.fasta (insertion sequences)
     #   example.insertions.vcf (insertion variants)
     #   example.info.txt (log file)
    

    Example for gap-filling between contigs:

    build/bin/MindTheGap fill -in data/contig-reads.fasta.gz -contig data/contigs.fasta -abundance-min 3 -out contig_example
    # 4 files are generated
    #   contig_example.h5 (de bruijn graph)
    #   contig_example.insertions.fasta (gap-filling sequences)
    #   contig_example.gfa (genome graph)
    #   contig_example.info.txt (log file)
    

    The usage of the fill module is a little bit different depending on the type of gap-filling : assembling insertion variants (using the -bkptoption with a breakpoint file) or gap-filling between contigs (using the -contig option with a contig fasta file).

Details

  1. Input sequencing read data

    For both modules, read dataset(s) are first indexed in a De Bruijn graph. The input format of read dataset(s) is either the read files themselves (option -in), or the already computed de bruijn graph in hdf5 format (.h5) (option -graph).
    NOTE: options -in and -graph are mutually exclusive, and one of these is mandatory.

    If the input is composed of several read files, they can be provided as a list of file paths separated by a comma or as a "file of file" (fof), that is a text file containing on each line the path to each read file. All read files will be treated as if concatenated in a single sample. The read file format can be fasta, fastq or gzipped.

  2. de Bruijn graph creation options

    In addition to input read set(s), the de Bruijn graph creation uses two main parameters, -kmer-size and -abundance-min:

    • -kmer-size: the k-mer size [default '31']. By default, the largest kmer-size allowed is 128. To use k>128, you will need to re-compile MindTheGap as follows:

      cd build/
      cmake -DKSIZE_LIST="32 64 96 256" ..
      make
      

      To go back to default, replace 256 by 128. Note that increasing the range between two consecutive kmer-sizes in the list can have an impact on the size of the output h5 files (but none on the results).

    • -abundance-min: the minimal abundance threshold, k-mers having less than this number of occurrences are discarded from the graph [default 'auto', ie. automatically inferred from the dataset].

    • -abundance-max: the maximal abundance threshold, k-mers having more than this number of occurrences are discarded from the graph [default '2147483647' ie. no limit].

  3. Computational resources options

    Additional options are related to computational runtime and memory:

    • -nb-cores: number of cores to be used for computation [default '0', ie. all available cores will be used].
    • -max-memory: max RAM memory for the graph creation (in MBytes) [default '2000']. Increasing the memory will speed up the graph creation phase.
    • -max-disk: max usable disk space for the graph creation (in MBytes) [default '0', ie. automatically set]. Kmers are counted by writing temporary files on the disk, to speed up the counting you can increase the usable disk space.
  4. MindTheGap Output

    All the output files are prefixed either by a default name: "MindTheGap_Expe-[date:YY:MM:DD-HH:mm]" or by a user defined prefix (option -out of MindTheGap).

    The main results files are output by the Fill module, these are:

    • an insertion variant file (.insertions.vcf) in vcf format, in the case of insertion variant detection (for insertions >2 bp).

    • an assembly graph file (.gfa) in GFA format, in the case of contig gap-filling. It contains the original contigs and the obtained gap-fill sequences (nodes of the graph), together with their overlapping relationships (arcs of the graph).

    Additional output files are:

    • a graph file (.h5), output by both MindTheGap modules. This is a binary file containing the de Bruijn graph data structure. To obtain information stored in it, you can use the utility program dbginfo located in your bin directory or in ext/gatb-core/bin/.

    • Files output specifically by MindTheGap find:

      • a breakpoint file (.breakpoints) in fasta format.

      • a variant file (.othervariants.vcf) in vcf format. It contains SNPs, deletions and very small insertions (1-2 bp).

    • Files output specifically by MindTheGap fill:

      • a sequence file (.insertions.fasta) in fasta format. It contains the inserted sequences (for insertions >2 bp) or contig gap-fills that were successfully assembled.

      • a log file (.info.txt), a tabular file with some information about the filling process for each breakpoint/grap-fill.

      • with option -extend, an additional sequence file (.extensions.fasta) in fasta format. It contains sequence extensions for failed insertion or gap-filling assemblies, ie. when the target kmer was not found, the first contig immediately after the source kmer is output.

Other optional parameters and details on input and output file formats are given in doc/MindTheGap_insertion_caller.md and doc/MindTheGap_assembly.md, depending on the usage.

Utility programs

Either in your bin/ directory or in ext/gatb-core/bin/, you can find additional utility programs :

  • dbginfo : to get information about a graph stored in a .h5 file
  • dbgh5 : to build a graph from read set(s) and obtain a .h5 file
  • h5dump : to extract data stored in a .h5 file

Reference

If you use MindTheGap, please cite:

MindTheGap: integrated detection and assembly of short and long insertions. Guillaume Rizk, Anaïs Gouin, Rayan Chikhi and Claire Lemaitre. Bioinformatics 2014 30(24):3451-3457. http://bioinformatics.oxfordjournals.org/content/30/24/3451

Web page with some updated results.

MindTheGap was also evaluated in a recent benchmark exploring many different genomic features (size, nature, repeat context, junctional homology at breakpoints) of human insertion variants. Among other tested SV callers, MindTheGap was the only tool able to output sequence-resolved insertions for many types of insertions. Read more: Towards a better understanding of the low recall of insertion variants with short-read based variant callers. Delage W, Thevenon J, Lemaitre C. BMC Genomics 2020, 21(1):762.

Contact

To contact a developer, request help, or for any feedback on MindTheGap, please use the issue form of github: https://github.com/GATB/MindTheGap/issues

You can see all issues concerning MindTheGap here and GATB here.

If you do not have any github account, you can also send an email to claire dot lemaitre at inria dot fr

mindthegap's People

Contributors

cdeltel avatar cguyomar avatar clemaitre avatar genscale-admin avatar natir avatar rchikhi avatar rizkg avatar wesde avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mindthegap's Issues

installation issue

When I install MindeTheGap it produces " /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found", but I have no permission to install it on server. Is there any way to go forward?

Extremely Large Run-Time in 'Contig-Fill' Mode

Hi, I'm running MindTheGap version 2.2.2 in the contig gap-filling mode. I gave it 28 threads to run off of and started it on May 27. I logged into my computer to check for any updates, and it estimated a remaining time of 115,085 minutes (just under 80 days). Is this what should be expected?

As a note - I am threading this process, not parallelizing it.

recommended settings for nanopore reads

I'm using MindTheGap find to find a 1600 bp insertion inside E coli using oxford nanopore reads. The program runs quickly and reports numerous small insertions and deletions, but the expected longer insertion is not reported.

I know that long and noisy nanopore reads are really different than the short and precise Illumina reads, and I was wondering if you have recommended settings that could help track down this longer insertion.

Thank you so much!

Here is my current command: MindTheGap find -in seqs.fastq.gz -ref ecoli.fasta.gz

./MindTheGap 'fill' takes too long time for WGS data

Hi,
I ran MindTheGap on a whole genome sequence data(30x, paired-end 101bp data). The 'find' part has ended in a reasonable time. I ran the 'fill' part similar to the following command:

./MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example -max-memory 500000 -max-disk 1000000 -nb-cores 84

I increased the -max-memory, -max-disk, and -nb-cores just to speed up the process (The machine has 96 cores (I did not want to use all of the cores), 1TB memory, and more than 2TB disk space).

After ~4,5 - 5 hours, I get this message as time estimate:
[Filling breakpoints ] 1.03 % elapsed: 288 min 1 sec remaining: 27789 min 39 sec

which makes 19 more days! Am I doing something wrong? How can I speed up the 'fill' function?

Thank you very much for you help!

result issue

I have a problem with the result of MindTheGap.
I simulated 1000 variants in chr15.fa including 524 insertions and 476 deletions with SURVIVOR and ART. I got the result with MindTheGap find and fill mode, just like the README shown.
MindTheGap find -in pair-end1.fq,pair-end2.fq -ref ../chr15/chr15.fa -out mindthegap MindTheGap fill -graph mindthegap.h5 -bkpt mindthegap.breakpoints -out mind-result
Finally, I got 507 insertions in mind-result.insertion.vcf. The breakpoints shown in vcf file is very diffenent from the simulated data. Does the points in vcf file correspond to the simulated insertion breakpoints?
Did I miss something or make something wrong?
Hope you reply ASAP and I'm grateful if you give me some clues.

ERROR: Unknown parameter '-contig'

I am running MindTheGap in 'contig gap-filling' mode and am attempting to run this command:

'''
MindTheGap fill -nb-cores ${task.cpus} -in 18-01_reads.fq -contig 18-01_assembly.fa -kmer-size 51 -abundance-min 5 -max-nodes 300 -max-length 50000 -out 18-01_gapFilled
'''

However, I keep receiving this error:
'''
ERROR: Unknown parameter '-contig'
ERROR: Unknown parameter '18-01_assembly.fa'
'''
All my reads were gzipped at first to conserve space, and I initially thought that the program could only handle unzipped files, so I gunzipped all of them and re-ran the command, but I am still receiving this same error. I am running this program on an HPC through a container that I downloaded from https://quay.io/repository/biocontainers/mindthegap?tab=tags.

Thank you for your help,
Ashley

Exception: Hash16: max size for this hash is 2^32, but ask for 33

Hello,

I ran MindTheGap on a high coverage (~200x) whole human genome data with a command like this:
./MindTheGap find -in S1_1.fastq.gz,S1_2.fastq.gz,S2_1.fastq.gz,S2_2.fastq.gz,...,S18_1.fastq.gz,S18_2.fastq.gz -ref human_g1k_v37.fasta -nb-cores 72 -max-memory 200000 -out SAMPLE

and got this exception after running for quite a while:

"EXCEPTION: Hash16: max size for this hash is 2^32, but ask for 33."

What might cause this problem. Did I misuse the computational parameters again? (The machine has
99 cores and 1TB memory.)

Thank you very much,
pinar

No license information for src/CircularBuffer.hpp

Hello,

My name is Shayan Doust1, a contributor to the Debian-Med team2. I have packaged MindTheGap3, however uploading efforts are unsuccessful as there is no licensing information in src/CircularBuffer.hpp.

Could you please clarify the licensing information within this file? Right now, it only contains a copyright holder but no licensing information. Ideally, include the licensing information in this file (just like the other source files) and generate a new release when you are ready. That way, I can simply integrate the new changes within the package and try for another upload to the Debian repository.

Kind regards,
Shayan Doust

Config Files

Hi, I'm running MindTheGap in a cluster but it consumes up the space in my /home directory by writing a lot of trashme_* files. I'm using MindTheGap for 3k rice genomes. Are these files necessary? Can we disable them?

EXCEPTION: Failure because of unhandled kmer size 128

Hi

Thanks a lot for this great tool.

I wanted to try kmer size 128, but it seems not to work:

EXCEPTION: Failure because of unhandled kmer size 128

-kmer-size 96 works fine (using my real data, I can find a known breakpoint - although the insertion is too large to be assembled (~9kb))

I tried both MindTheGap-v2.2.1-bin-Linux.tar.gz
and cloning from github.

Using Centos 7.

Thanks in advance for your help

Best wishes

Matt Shenton

Readme issue

-no-[type]: to disable the detection of certain types of variants.

It's not clear that e.g. " -no-snp" is an option, as [type] is never defined

Memory usage of fill module

Is there a way to limit the memory consumption of the fill module? I have samples which are allocation ~40-50GB of RAM (for sample of only ~2 million 150 bp paired end reads).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.