Coder Social home page Coder Social logo

avocado's People

Contributors

andrewmchen avatar arahuja avatar brielin avatar davidonlaptop avatar fnothaft avatar hammer avatar heuermh avatar jey avatar jondeaton avatar jstjohn avatar massie avatar nealsid avatar peterhj avatar rnpandya avatar tdanford avatar timodonnell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

avocado's Issues

Add flanking sequence collection to groupBy phase of assembly

#56 added parametrized flanking sequence length to the KmerGraph. We need to extend our groupBy phase that collects all the reads associated with a region to collect the reads that are contained in the flanking region. Additionally, we need to develop a strategy for handling calls made inside of the flanking region; a first approach can just filter calls made that overlap the flank.

Support loading multiple read files

We should support a way to load multiple read files at a single time, and to define which samples are in that file. For the AlignedReadsInputStage, I propose that we allow the following input:

<SAMPLES>:<PATH>, ... , <SAMPLES>:<PATH>

We'd then load an RDD per sample, and aggregate across files as is necessary.

Update caching strategy

Our current RDD caching strategy can lead to out of memory errors. We should revise to more aggressively unpersist old data. We remove some caching in #70.

Understanding Avocado

I am trying to understand how Avocado works. Is there a documentation page for how the code (the master branch) is laid out?

I am looking at a report available online (http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf), and trying to map what I read there to the code in the repository. But it seems that things have somewhat diverged from the report.

For example, where do the assembly based calls happen? There seems to be something in the algorithms/ folder, but nothing seems to call it. Does the master branch do the read based calls?

Maybe I am missing something here, could you guys point me to any resources that clears this up?

Explanation of isCallable

Is there an explanation for the isCallable value for a VariantCall? This seems fixed for different variant calling algorithms, but I don't understand what it represents.

BialleicGenotyper throws NPE if MapQ is null

Reported on the ADAM mailing list. If MapQ is not set, we get the following NPE:

15/02/16 17:38:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 80, ISTB1-L2-B14-05.hadoop.priv): java.lang.NullPointerException
        at scala.Predef$.Integer2int(Predef.scala:392)
        at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:55)
        at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
        at org.apache.spark.rdd.Timer.time(Timer.scala:57)
        at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
        at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:165)
        at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:165)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

This is appearing on 35a6035.

Avocado cannot run with Hadoop 1.0.4.

Paging @massie

When we build avocado with Hadoop version set to 1.0.4 (necessary for default spark/ec2 deploy scripts), avocado throws an error:

Exception in thread "main" java.lang.NoSuchMethodError: org.codehaus.jackson.type.JavaType.<init>(Ljava/lang/Class;)V
    at org.codehaus.jackson.map.type.SimpleType.<init>(SimpleType.java:36)
    at org.codehaus.jackson.map.type.SimpleType.<clinit>(SimpleType.java:20)
    at org.codehaus.jackson.map.type.TypeFactory.<init>(TypeFactory.java:42)
    at org.codehaus.jackson.map.type.TypeFactory.<clinit>(TypeFactory.java:15)
    at org.codehaus.jackson.map.ObjectMapper.<clinit>(ObjectMapper.java:42)
    at org.apache.avro.Schema.<clinit>(Schema.java:80)
    at org.apache.avro.generic.GenericData.<clinit>(GenericData.java:862)
    at org.apache.avro.specific.SpecificDatumReader.<init>(SpecificDatumReader.java:31)
    at org.bdgenomics.adam.serialization.AvroSerializer.<init>(ADAMKryoRegistrator.scala:38)
    at org.bdgenomics.adam.serialization.ADAMKryoRegistrator.registerClasses(ADAMKryoRegistrator.scala:68)
    at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:64)
    at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:61)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:61)
    at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:116)
    at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:79)
    at org.apache.spark.broadcast.HttpBroadcast$.write(HttpBroadcast.scala:144)
    at org.apache.spark.broadcast.HttpBroadcast.<init>(HttpBroadcast.scala:44)
    at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcast.scala:73)
    at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcast.scala:69)
    at org.apache.spark.broadcast.BroadcastManager.newBroadcast(Broadcast.scala:95)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:617)
    at org.apache.spark.rdd.NewHadoopRDD.<init>(NewHadoopRDD.scala:59)
    at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:471)
    at org.bdgenomics.adam.rdd.ADAMContext.adamParquetLoad(ADAMContext.scala:200)
    at org.bdgenomics.adam.rdd.ADAMContext.adamSequenceLoad(ADAMContext.scala:312)

@jey thinks this is some sort of dependency hell issue.

Add a somatic "genotyper"

With the refactoring from #96, we should be able to cleanly add a somatic variant caller in the genotyping stage.

Partitioning can fail

Reported by @rnpandya. We can see the following error:

2014-06-23 13:48:49 ERROR Executor:95 - Exception in task ID 8
java.lang.IllegalArgumentException: Received region with contig ID chrM, but do not have a matching reference contig.
at org.bdgenomics.avocado.partitioners.DefaultPartitionSet.getPartition(DefaultPartitioner.scala:150)
at org.bdgenomics.avocado.partitioners.PartitionSet.getPartition(PartitionSet.scala:86)
at org.bdgenomics.avocado.calls.reads.ReadCallHaplotypes$$anonfun$22.apply(ReadCallHaplotypes.scala:478)
at org.bdgenomics.avocado.calls.reads.ReadCallHaplotypes$$anonfun$22.apply(ReadCallHaplotypes.scala:476)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

With the following sequence dictionaries:

Sequence dictionary from contigs:
SequenceDictionary{
chrM->16299}
Sequence dictionary from reads:
SequenceDictionary{
chr17->95052499
chr12->119596406
chr18->90827538
chr13->120421117
chrY->91744698
chr3->159377569
chr5->151694555
chr2->181993374
chr1->194532772
chr15->104036045
chrX->170682864
chr8->129558836
chr7->142600211
chr11->122260198
chr16->98095624
chr9->124827824
chr4->155544171
chr10->130854437
chr14->124162362
chr6->149307077
chrM->16299}

I believe that this is due to an issue with string encoding in the ADAM ReferenceRegion object.

Add back the local assembler

#96 temporarily removes the local assembler. The local assembler should be added back, preferably without a strong reliance on the HMM aligner, which is slow.

Add "gVCF" filter

Currently, we emit genotype records for all observed sites (i.e., a gVCF style record). In case we don't want this, we should have a filter which filters out "reference" calls.

Generalize alignment first stage

#72 adds code to align short reads with SNAP. We should generalize this code so that other aligners (e.g., BWA-MEM) can use the FASTQ distribution and BAM-->ADAM conversion.

Refactor mpileup style call

Remove single sample call, and increase generality of pileup caller so that sufficient statistics can be merged in.

Avocado fails with NPE to convert bam.adam to bam.vcf

I haven't looked any deeper into this. Happy to provide input files to reproduce.
../avocado/bin/avocado-submit sim.bam.adam chr22.fa.adam sim.vcf.adam ../avocado/avocado-sample-configs/basic.properties
Spark assembly has been built with Hive, including Datanucleus jars on classpath
2015-04-06 18:47:48 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Loading reads in from sim.bam.adam
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 3.0 in stage 3.0 (TID 6)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 WARN TaskSetManager:71 - Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

2015-04-06 18:47:56 ERROR TaskSetManager:75 - Task 3 in stage 3.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 6
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: reading another 6 footers
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.SplitStrategy: Using Client Side Metadata Split Strategy
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ClientSideMetadataSplitStrategy: There were no row groups that could be dropped due to filter predicates
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 2
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: reading another 2 footers
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.SplitStrategy: Using Client Side Metadata Split Strategy
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ClientSideMetadataSplitStrategy: There were no row groups that could be dropped due to filter predicates
Apr 6, 2015 6:47:53 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5131 records.
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 75 ms. row count = 5131
Apr 6, 2015 6:47:54 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5131 records.
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 20 ms. row count = 5131
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 2.0 in stage 3.0 (TID 5)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 1.0 in stage 3.0 (TID 4)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Add reference masking

Users should be able to provide a BED file to mask the reference genome if they only want to process a subset of the genome (e.g., the exome, or a targeted sequencing panel).

Improve HaplotypePair scoring

From @fnothaft:
"I think we'd want to generally fix how we do the Haplotype scoring, as the current scoring system bumps the haplotype pair score for reads that could map ambiguously to either haplotype in the pair. I'm working on a proposal for this, that I hope to distribute for comment early next week."

Configured input has user defined name and fixed 'name field

Do we need both of these name fields? It seems the user defined name is only shorthand to set config, but we require the user to specify the full algorithm anyways so having two different names ( the user defined name and fixed name in the class) may be confusing.

Fix deletion caller.

We run into an error when converting back to VCF because HTSJDK doesn't like our (admittedly poor) deletion notation.

Utility for generating artificial tumor/normal sequence data

We need a method for generating (from normal exome sequencing data) "synthetic" BAMs that can be used to test the sensitivity/specificity of parameter settings in somatic genotypers. E.g. here is the Mutect description of a utility, "SomaticSpike,"

Using the published high-confidence single-nucleotide polymorphism (SNP) 
genotypes for those samples from the 1000 Genomes Project, we identified 
a set of sites that are heterozygous in NA12891 and homozygous for the 
reference in NA12878. We then used a second utility, SomaticSpike, which 
is part of the MuTect software package, to perform a mixing experiment in 
silico. At each of the selected sites, this utility attempts to replace a number 
of reads determined by a binomial distribution using a specified allelic fraction 
in the NA12878 data with reads from the NA12891 data, therefore simulating 
a somatic mutation of known location, type and expected allele fraction. If 
there are not enough reads in NA12891 to replace the desired reads in 
NA12878, the site is skipped. The output of this process is a virtual tumor BAM 
with the in silico variants and a set of locations of those variants. Sensitivity is 
then estimated by attempting to detect mutations at these sites.

The build should use the version 0.1.3-SNAPSHOT of bdg-utils-metrics:jar

Issue

Currently maven looks for version 0.1.2-SNAPSHOT, but apparently this artefact cannot be found.

Steps to reproduce:

  1. Fetch master branch
  2. mvn package

Error Output

[ERROR] Failed to execute goal on project avocado-core: Could not resolve dependencies for project org.bdgenomics.avocado:avocado-core:jar:0.0.3-SNAPSHOT: The following artifacts could not be resolved: org.bdgenomics.bdg-utils:bdg-utils-metrics:jar:0.1.2-SNAPSHOT, org.bdgenomics.bdg-utils:bdg-utils-misc:jar:tests:0.1.2-SNAPSHOT: Could not find artifact org.bdgenomics.bdg-utils:bdg-utils-metrics:jar:0.1.2-SNAPSHOT in Sonatype (http://oss.sonatype.org/content/repositories/snapshots/) -> [Help 1]

Potential resolution

Use the new version that is available: https://oss.sonatype.org/content/repositories/snapshots/org/bdgenomics/bdg-utils/bdg-utils-metrics/

Change line 26 of pom.xml as follows:
<utils.version>0.1.3-SNAPSHOT</utils.version>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.