Coder Social home page Coder Social logo

brentp / indelope Goto Github PK

View Code? Open in Web Editor NEW
39.0 9.0 1.0 136 KB

find large indels (in the blind spot between GATK/freebayes and SV callers)

License: MIT License

Nim 56.37% C 43.05% Shell 0.58%
genomics variant-calling k-mer-counting vcf nim-lang local-assembly genome-assembly

indelope's Introduction

indelope: find indels and SVs too small for structural variant callers and too large for GATK

indelope was started with the goal of increasing the diagnostic rate in exomes. To do this it must be:

  • fast : ~2.5 CPU-minutes per exome (25% slower than samtools view -c)
  • easy-to-use : goes from BAM to VCF in a single command.
  • novel : it does local assembly and then aligns assembled contigs to the genome to determine the event, and then does k-mer counting (not alignment) to genotype without k-mer tables.
  • accurate : because of the genotyping method, we know that called variants are not present in the reference.

These features will help ensure that it is actually used (fast, easy-to-use) and that it finds new and valid variation.

As of November 2017, indelope is working -- it finds large indels that are clearly valid by visual inspection that are missed by GATK/freebayes/lumpy.

As of November 2017, I am still tuning. Here is a look at the progress:

image

Note that while indelope is steadily improving, it still is not as good as scalpel. More improvements are coming soon.

indelope also works on whole genomes, but, for now, that is not the target use-case.

how it works

indelope sweeps over a single bam and finds regions that are likely to harbor indels--reads that have more than 1 cigar event and split-reads (work on split reads is in progress). As it finds these it increments a counter for the genomic position of the event. Upon finding a gap in coverage, it goes back, finds any previous position with sufficient evidence (this is a parameter) of an event, gathers reads that have been aligned across that genomic position (and unaligned reads from that region) and does assembly on those reads. It then aligns the assembled contigs to the genome using ksw2 and uses the CIGAR to determine the event as it's represented in the VCF. Any event will result in a novel k-mer not present in the reference genome; indelope gets the k-mer of the reference genome at the event and the novel k-mer of the alternate event. It again iterates through the reads that were originally aligned to the event and counts reference and alternate k-mers. Those counts are used for genotyping. Note that this reduces reference bias because we are aligning a contig (often >400 bases) to the genome and never re-aligning the actual reads.

As indelope sweeps across the genome, it keeps the reads for each chunk in memory. A chunk bound is defined by a gap in coverage; this occurs frequently enough that the memory use is negligible. Once a new chunk is reached, all events from the previous chunk are called and then those reads are discarded. This method, along with the assembly method make indelope extremely fast--given 2 BAM decompression threads, it can call variants in an exome in ~ 1 minute (2.5 CPU-minutes).

assembly

A read (contig) slides along another read (contig) to find the offset with the most matches. At each offset, if more than $n mismatches are found, the next offset is attempted. This is efficient enough that a random read to a random (non-matching) contig of length $N will incur ~ 1.25 * $N equality (char vs. char) tests.

Within each contig indelope tracks the number of reads supporting each base. Given a sufficient number of reads supporting a contig, it can account for sequencing errors with a simple voting scheme. That is: if contig a, position x has more than 7 supporting reads and contig b has fewer than 3 supporting reads (and we know that otherwise, a and b have no mismatches), we can vote to set the mismatch in b to the apparent match in a. This allows us to first create contigs allowing no mismatches within the reads and then to combine and extend contigs using this voting method.

installation and usage

get a binary from here and make sure that libhts.so is in your LD_LIBRARY_PATH

then run ./indelope -h for usage. recommended is:

indelope --min-event-len 5 --min-reads 5 $fasta $bam > $vcf

to do

  • somatic mode / filter mode. allow filtering on a set of k-mers from a parental genome (parent for mendelian context or normal for somatic variation).

  • use SA tag. (and possibly discordant reads)

see also

  • svaba does local assembly, but then genotypes by alignment to those assemblies. It is slower than indelope but it is an extremely useful tool and has a series of careful and insightful analyses in its paper. (highly recommend!!)

  • rufus does k-mer based variant detection; Andrew described to me the RUFUS assembly method that inspired the one now used in indelope.

  • lancet, scalpel, mate-clever, and prosic2 are all great tools that are similar in spirit that are worth checking out (of those, AFAICT, only scalpel has a focus on germ-line variation).

notes and TODO

need a better way to combine contigs

sometimes, can have 2 contigs, each of length ~ 80 and they overlap for 60 bases but cutoff is e.g. 65. Need a way to recover this as it happens a lot in low-coverage scenarios. maybe it can first combine, then trim (currently, it's trim, combine). This should also allow more permissive overlaps if the correction list is empty.

track a read/contig matches multiple contigs with the same match, mismatch count

CHM1/13 truth-set

https://www.ncbi.nlm.nih.gov/biosample?Db=biosample&DbFrom=bioproject&Cmd=Link&LinkName=bioproject_biosample&LinkReadableName=BioSample&ordinalpos=1&IdsFromResult=316945

~/.aspera/connect/bin/ascp -P33001 -QT -L- -l 1000M -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/ERA596/ERA596286/bam/CHM1_1.bam . ~/.aspera/connect/bin/ascp -P33001 -QT -L- -l 1000M -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/ERA596/ERA596286/bam/CHM13_1.bam .

ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/nstd137_Huddleston_et_al_2016/genotype/CHM1_final_genotypes.annotated.vcf.gz ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/nstd137_Huddleston_et_al_2016/genotype/CHM13_final_genotypes.annotated.vcf.gz ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/nstd137_Huddleston_et_al_2016/genotype/

contigs

min_overlap in contig::best_match should be a float between 0 and 1 that will make sure that at least that portion of the shortest contig overlaps the other.

indelope's People

Contributors

brentp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

indelope's Issues

could not load: libhts.so, but in LD_LIBRARY_PATH

Hi,

My usual CentOS6 server has an outdated libc.so.6 so I have to use a more recent OS. Sysadmin told me this cannot be updated, unfortunately. Anyway, on the other server (CentOS7) I downloaded the binary and installed the libhts.so using conda install htslib.

Since find ~ -name libhts.so returns

/home/wdecoster/anaconda3/pkgs/htslib-1.6-0/lib/libhts.so
/home/wdecoster/anaconda3/lib/libhts.so

I changed the LD_LIBRARY_PATH to
/home/wdecoster/anaconda3/lib/:/home/wdecoster/anaconda3/pkgs/htslib-1.6-0/lib/

Yet, when executing ./indelope -h I get:

could not load: libhts.so
compile with -d:nimDebugDlOpen for more information

Did I mess up somewhere?
I can install nim and try compiling with the suggested -d:nimDebugDlOpen flag, or what do you suggest?

Installation on my Ubuntu17.10 desktop does work.

Cheers,
Wouter

More instructions to build using nim/nimble

If you get a chance to include more instructions on how to build using nim/nimble as there are no instructions and would like to know in detail how to get it to recognize the "import hts" when compiling src/indelope.nim

Binary for libc-2.12.so

Hi Brent,

I have same issue as #1

This is a bump to release a binary that will work with libc-2.12.so as would like to give it a go over the next few days on exomes at the Broad Institute.

I know you commented that you would like to get get issues relating to using the tool itself BUT you need to a provide binary that actually works first for people ๐Ÿ˜ฉ

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.