bigdatagenomics / avocado Goto Github PK

View Code? Open in Web Editor NEW

71.0 71.0 42.0 2.3 MB

A Variant Caller, Distributed. Apache 2 licensed.

Home Page: http://bdgenomics.org/projects/avocado/

License: Apache License 2.0

Scala 97.11% Shell 2.36% Java 0.53%

avocado's People

Contributors

Stargazers

Watchers

avocado's Issues

Add flanking sequence collection to groupBy phase of assembly

#56 added parametrized flanking sequence length to the KmerGraph. We need to extend our groupBy phase that collects all the reads associated with a region to collect the reads that are contained in the flanking region. Additionally, we need to develop a strategy for handling calls made inside of the flanking region; a first approach can just filter calls made that overlap the flank.

Support loading multiple read files

We should support a way to load multiple read files at a single time, and to define which samples are in that file. For the AlignedReadsInputStage, I propose that we allow the following input:

<SAMPLES>:<PATH>, ... , <SAMPLES>:<PATH>

We'd then load an RDD per sample, and aggregate across files as is necessary.

Add integration tests for the BiallelicGenotyper

A base set of integration tests (or expanded unit tests) would connect the ReadExplorer and BiallelicGenotyper from #96 together.

Move to appassembler for packaging

Floating point underflow with very high coverage

When we have very high coverage data, we run into floating point underflow in the biallelic genotyper. An easy way to fix this is by moving to log likelihoods.

Add an implementation of Mutect

See this paper: http://www.nature.com/nbt/journal/v31/n3/abs/nbt.2514.html
(and the supplementary information) for details.

Getting HaplotypePair likelihoods depends on Haplotype likelihoods

HaplotypePair.perReadLikelihoods depends on the composing Haplotypes having their perReadLikelihoods set, which is not guaranteed

Update caching strategy

Our current RDD caching strategy can lead to out of memory errors. We should revise to more aggressively unpersist old data. We remove some caching in #70.

Add back the external caller

#96 removed the external variant caller stub from #76, this should be added back.

Doubly compensating likelihoods in biallelic model

I thought I had fixed this in #134, but instead I fixed it in #127.

Understanding Avocado

I am trying to understand how Avocado works. Is there a documentation page for how the code (the master branch) is laid out?

I am looking at a report available online (http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf), and trying to map what I read there to the code in the repository. But it seems that things have somewhat diverged from the report.

For example, where do the assembly based calls happen? There seems to be something in the algorithms/ folder, but nothing seems to call it. Does the master branch do the read based calls?

Maybe I am missing something here, could you guys point me to any resources that clears this up?

BQSR input fails to load from HDFS

This is a bug with how we create SNP tables.

Add better heuristic for estimating how many times to enter a loop

In #67, we introduced code to prevent us from looping too many times if we entered a repeat. This is useful for regions with short repeats, but in repetitive regions, we probably want to add smarter constraints, e.g., we try to estimate the repeat number.

Explanation of isCallable

Is there an explanation for the isCallable value for a VariantCall? This seems fixed for different variant calling algorithms, but I don't understand what it represents.

BialleicGenotyper throws NPE if MapQ is null

Reported on the ADAM mailing list. If MapQ is not set, we get the following NPE:

15/02/16 17:38:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 80, ISTB1-L2-B14-05.hadoop.priv): java.lang.NullPointerException
        at scala.Predef$.Integer2int(Predef.scala:392)
        at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:55)
        at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
        at org.apache.spark.rdd.Timer.time(Timer.scala:57)
        at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
        at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:165)
        at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:165)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

This is appearing on 35a6035.

Add instrumentation to avocado

Now that bigdatagenomics/utils#19 and bigdatagenomics/adam#478 are merged, let's use @nfergu's superb timers to instrument avocado!

Avocado depends on ADAM-CLI

Avocado should only depend on adam-core and one of {bdg,adam}-format

Remove *Context References

bigdatagenomics/adam#567 removed all of the *Contexts except for ADAMContext; we now need to remove these references from avocado.

Can't add generic Spark arguments through submit script

Reported via the ADAM mailing list.

Avocado cannot run with Hadoop 1.0.4.

Paging @massie

When we build avocado with Hadoop version set to 1.0.4 (necessary for default spark/ec2 deploy scripts), avocado throws an error:

Exception in thread "main" java.lang.NoSuchMethodError: org.codehaus.jackson.type.JavaType.<init>(Ljava/lang/Class;)V
    at org.codehaus.jackson.map.type.SimpleType.<init>(SimpleType.java:36)
    at org.codehaus.jackson.map.type.SimpleType.<clinit>(SimpleType.java:20)
    at org.codehaus.jackson.map.type.TypeFactory.<init>(TypeFactory.java:42)
    at org.codehaus.jackson.map.type.TypeFactory.<clinit>(TypeFactory.java:15)
    at org.codehaus.jackson.map.ObjectMapper.<clinit>(ObjectMapper.java:42)
    at org.apache.avro.Schema.<clinit>(Schema.java:80)
    at org.apache.avro.generic.GenericData.<clinit>(GenericData.java:862)
    at org.apache.avro.specific.SpecificDatumReader.<init>(SpecificDatumReader.java:31)
    at org.bdgenomics.adam.serialization.AvroSerializer.<init>(ADAMKryoRegistrator.scala:38)
    at org.bdgenomics.adam.serialization.ADAMKryoRegistrator.registerClasses(ADAMKryoRegistrator.scala:68)
    at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:64)
    at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:61)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:61)
    at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:116)
    at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:79)
    at org.apache.spark.broadcast.HttpBroadcast$.write(HttpBroadcast.scala:144)
    at org.apache.spark.broadcast.HttpBroadcast.<init>(HttpBroadcast.scala:44)
    at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcast.scala:73)
    at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcast.scala:69)
    at org.apache.spark.broadcast.BroadcastManager.newBroadcast(Broadcast.scala:95)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:617)
    at org.apache.spark.rdd.NewHadoopRDD.<init>(NewHadoopRDD.scala:59)
    at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:471)
    at org.bdgenomics.adam.rdd.ADAMContext.adamParquetLoad(ADAMContext.scala:200)
    at org.bdgenomics.adam.rdd.ADAMContext.adamSequenceLoad(ADAMContext.scala:312)

@jey thinks this is some sort of dependency hell issue.

Re-integrate EM algorithm for calculating genotype state priors

This was removed during re-packaging a few months back (believe in #96).

Add a somatic "genotyper"

With the refactoring from #96, we should be able to cleanly add a somatic variant caller in the genotyping stage.

Move off dependency on ADAM SNAPSHOT release

Until we have a production ADAM release we need to depend on the SNAPSHOT releases from ADAM, but we should look to move off of this.

Partitioning can fail

Reported by @rnpandya. We can see the following error:

2014-06-23 13:48:49 ERROR Executor:95 - Exception in task ID 8
java.lang.IllegalArgumentException: Received region with contig ID chrM, but do not have a matching reference contig.
at org.bdgenomics.avocado.partitioners.DefaultPartitionSet.getPartition(DefaultPartitioner.scala:150)
at org.bdgenomics.avocado.partitioners.PartitionSet.getPartition(PartitionSet.scala:86)
at org.bdgenomics.avocado.calls.reads.ReadCallHaplotypes$$anonfun$22.apply(ReadCallHaplotypes.scala:478)
at org.bdgenomics.avocado.calls.reads.ReadCallHaplotypes$$anonfun$22.apply(ReadCallHaplotypes.scala:476)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

With the following sequence dictionaries:

Sequence dictionary from contigs:
SequenceDictionary{
chrM->16299}
Sequence dictionary from reads:
SequenceDictionary{
chr17->95052499
chr12->119596406
chr18->90827538
chr13->120421117
chrY->91744698
chr3->159377569
chr5->151694555
chr2->181993374
chr1->194532772
chr15->104036045
chrX->170682864
chr8->129558836
chr7->142600211
chr11->122260198
chr16->98095624
chr9->124827824
chr4->155544171
chr10->130854437
chr14->124162362
chr6->149307077
chrM->16299}

I believe that this is due to an issue with string encoding in the ADAM ReferenceRegion object.

Upgrade to ADAM 0.16.1-SNAPSHOT

Upgrade avocado-submit to not use Spark configuration file

Similar to bigdatagenomics/adam@ac75e76, we should move to configuration flags.

Add back the local assembler

#96 temporarily removes the local assembler. The local assembler should be added back, preferably without a strong reliance on the HMM aligner, which is slow.

Add "gVCF" filter

Currently, we emit genotype records for all observed sites (i.e., a gVCF style record). In case we don't want this, we should have a filter which filters out "reference" calls.

Move to new alignment loader

bigdatagenomics/adam#468 changed the alignment loading approach from adamLoad to loadAlignments. We need to move this, as our tests are currently broken.

Annotator throws an NPE when annotating certain genotypes.

This happens if there's no alt/ref read coverage.

Generalize alignment first stage

#72 adds code to align short reads with SNAP. We should generalize this code so that other aligners (e.g., BWA-MEM) can use the FASTQ distribution and BAM-->ADAM conversion.

Refactor mpileup style call

Remove single sample call, and increase generality of pileup caller so that sufficient statistics can be merged in.

Avocado fails with NPE to convert bam.adam to bam.vcf

I haven't looked any deeper into this. Happy to provide input files to reproduce.
../avocado/bin/avocado-submit sim.bam.adam chr22.fa.adam sim.vcf.adam ../avocado/avocado-sample-configs/basic.properties
Spark assembly has been built with Hive, including Datanucleus jars on classpath
2015-04-06 18:47:48 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Loading reads in from sim.bam.adam
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 3.0 in stage 3.0 (TID 6)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 WARN TaskSetManager:71 - Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

2015-04-06 18:47:56 ERROR TaskSetManager:75 - Task 3 in stage 3.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 6
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: reading another 6 footers
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.SplitStrategy: Using Client Side Metadata Split Strategy
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ClientSideMetadataSplitStrategy: There were no row groups that could be dropped due to filter predicates
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 2
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: reading another 2 footers
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.SplitStrategy: Using Client Side Metadata Split Strategy
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ClientSideMetadataSplitStrategy: There were no row groups that could be dropped due to filter predicates
Apr 6, 2015 6:47:53 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5131 records.
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 75 ms. row count = 5131
Apr 6, 2015 6:47:54 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5131 records.
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 20 ms. row count = 5131
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 2.0 in stage 3.0 (TID 5)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 1.0 in stage 3.0 (TID 4)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Genotypes are not saved

Right now only the variants are saved as output, but not the genotypes

Separate config from variant calling algorithms

Remove the dependency on configuration from the avocado-core module.

Add reference masking

Users should be able to provide a BED file to mask the reference genome if they only want to process a subset of the genome (e.g., the exome, or a targeted sequencing panel).

Improve HaplotypePair scoring

From @fnothaft:
"I think we'd want to generally fix how we do the Haplotype scoring, as the current scoring system bumps the haplotype pair score for reads that could map ambiguously to either haplotype in the pair. I'm working on a proposal for this, that I hope to distribute for comment early next week."

Add basic configuration file to repository

Currently, we only have a config file for running avocado with SNAP as a frontend (https://github.com/bigdatagenomics/avocado/blob/master/avocado-sample-configs/snap-basic.properties); in reality, most people will want to just run from previously aligned reads, so we should add this config file.

Configured input has user defined name and fixed 'name field

Do we need both of these name fields? It seems the user defined name is only shorthand to set config, but we require the user to specify the full algorithm anyways so having two different names ( the user defined name and fixed name in the class) may be confusing.

Fix the dependencies which were broken by removing the ADAM* prefixes

The PR 418 (bigdatagenomics/adam#418) in the main adam repo removed the ADAM prefix from a number of class names as well as a few method names. The corresponding dependencies need to be fixed here in Avocado.

Delete merged branches

We've got a couple of merged branches with no commits on them: https://github.com/bigdatagenomics/avocado/branches?merged=1. Can we delete these branches?

Extract sliding window traversal from biallelic genotyper

This is generally necessary for #114 and other downstream genotyping models. This was added in 9a94579, IIRC.

Change references from ADAMFastaNucleotideContig to ADAMNucleotideContig

Current ADAM master and release branch use ADAMNucleotideContig instead

Fix deletion caller.

We run into an error when converting back to VCF because HTSJDK doesn't like our (admittedly poor) deletion notation.

Upgrade to bdg-formats 0.2.0/ADAM 0.13.0

An ADAM 0.13.0 release will be forthcoming; in the meanwhile, we need to update to the latest 0.12.2-SNAPSHOT changes to fix the build.

Timers get detached inside of BiallelicGenotyper

Small nit, but we aren't getting the timing details from inside of the BiallelicGenotyper.

Utility for generating artificial tumor/normal sequence data

We need a method for generating (from normal exome sequencing data) "synthetic" BAMs that can be used to test the sensitivity/specificity of parameter settings in somatic genotypers. E.g. here is the Mutect description of a utility, "SomaticSpike,"

Using the published high-confidence single-nucleotide polymorphism (SNP) 
genotypes for those samples from the 1000 Genomes Project, we identified 
a set of sites that are heterozygous in NA12891 and homozygous for the 
reference in NA12878. We then used a second utility, SomaticSpike, which 
is part of the MuTect software package, to perform a mixing experiment in 
silico. At each of the selected sites, this utility attempts to replace a number 
of reads determined by a binomial distribution using a specified allelic fraction 
in the NA12878 data with reads from the NA12891 data, therefore simulating 
a somatic mutation of known location, type and expected allele fraction. If 
there are not enough reads in NA12891 to replace the desired reads in 
NA12878, the site is skipped. The output of this process is a virtual tumor BAM 
with the in silico variants and a set of locations of those variants. Sensitivity is 
then estimated by attempting to detect mutations at these sites.

Factor region join into multi-region join with sampling partitioner

This is necessary for closing out #127.

The build should use the version 0.1.3-SNAPSHOT of bdg-utils-metrics:jar

Issue

Currently maven looks for version 0.1.2-SNAPSHOT, but apparently this artefact cannot be found.

Steps to reproduce:

Fetch master branch
mvn package

Error Output

[ERROR] Failed to execute goal on project avocado-core: Could not resolve dependencies for project org.bdgenomics.avocado:avocado-core:jar:0.0.3-SNAPSHOT: The following artifacts could not be resolved: org.bdgenomics.bdg-utils:bdg-utils-metrics:jar:0.1.2-SNAPSHOT, org.bdgenomics.bdg-utils:bdg-utils-misc:jar:tests:0.1.2-SNAPSHOT: Could not find artifact org.bdgenomics.bdg-utils:bdg-utils-metrics:jar:0.1.2-SNAPSHOT in Sonatype (http://oss.sonatype.org/content/repositories/snapshots/) -> [Help 1]

Potential resolution

Use the new version that is available: https://oss.sonatype.org/content/repositories/snapshots/org/bdgenomics/bdg-utils/bdg-utils-metrics/

Change line 26 of pom.xml as follows:
<utils.version>0.1.3-SNAPSHOT</utils.version>

Enhance statistics output for BiallelicGenotyper

Several things to do:

Add genotypeQuality --> conditional genotype probability
Compute nonReferenceLikelihoods
Compute strandBiasComponents