bigdatagenomics / avocado Goto Github PK
View Code? Open in Web Editor NEWA Variant Caller, Distributed. Apache 2 licensed.
Home Page: http://bdgenomics.org/projects/avocado/
License: Apache License 2.0
A Variant Caller, Distributed. Apache 2 licensed.
Home Page: http://bdgenomics.org/projects/avocado/
License: Apache License 2.0
#56 added parametrized flanking sequence length to the KmerGraph. We need to extend our groupBy phase that collects all the reads associated with a region to collect the reads that are contained in the flanking region. Additionally, we need to develop a strategy for handling calls made inside of the flanking region; a first approach can just filter calls made that overlap the flank.
We should support a way to load multiple read files at a single time, and to define which samples are in that file. For the AlignedReadsInputStage, I propose that we allow the following input:
<SAMPLES>:<PATH>, ... , <SAMPLES>:<PATH>
We'd then load an RDD per sample, and aggregate across files as is necessary.
A base set of integration tests (or expanded unit tests) would connect the ReadExplorer and BiallelicGenotyper from #96 together.
When we have very high coverage data, we run into floating point underflow in the biallelic genotyper. An easy way to fix this is by moving to log likelihoods.
See this paper: http://www.nature.com/nbt/journal/v31/n3/abs/nbt.2514.html
(and the supplementary information) for details.
HaplotypePair.perReadLikelihoods depends on the composing Haplotypes having their perReadLikelihoods set, which is not guaranteed
Our current RDD caching strategy can lead to out of memory errors. We should revise to more aggressively unpersist old data. We remove some caching in #70.
I am trying to understand how Avocado works. Is there a documentation page for how the code (the master branch) is laid out?
I am looking at a report available online (http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf), and trying to map what I read there to the code in the repository. But it seems that things have somewhat diverged from the report.
For example, where do the assembly based calls happen? There seems to be something in the algorithms/
folder, but nothing seems to call it. Does the master
branch do the read based calls?
Maybe I am missing something here, could you guys point me to any resources that clears this up?
This is a bug with how we create SNP tables.
In #67, we introduced code to prevent us from looping too many times if we entered a repeat. This is useful for regions with short repeats, but in repetitive regions, we probably want to add smarter constraints, e.g., we try to estimate the repeat number.
Is there an explanation for the isCallable value for a VariantCall? This seems fixed for different variant calling algorithms, but I don't understand what it represents.
Reported on the ADAM mailing list. If MapQ is not set, we get the following NPE:
15/02/16 17:38:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 80, ISTB1-L2-B14-05.hadoop.priv): java.lang.NullPointerException
at scala.Predef$.Integer2int(Predef.scala:392)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:55)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:165)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:165)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This is appearing on 35a6035.
Now that bigdatagenomics/utils#19 and bigdatagenomics/adam#478 are merged, let's use @nfergu's superb timers to instrument avocado!
Avocado should only depend on adam-core and one of {bdg,adam}-format
bigdatagenomics/adam#567 removed all of the *Context
s except for ADAMContext
; we now need to remove these references from avocado.
Reported via the ADAM mailing list.
Paging @massie
When we build avocado with Hadoop version set to 1.0.4 (necessary for default spark/ec2 deploy scripts), avocado throws an error:
Exception in thread "main" java.lang.NoSuchMethodError: org.codehaus.jackson.type.JavaType.<init>(Ljava/lang/Class;)V
at org.codehaus.jackson.map.type.SimpleType.<init>(SimpleType.java:36)
at org.codehaus.jackson.map.type.SimpleType.<clinit>(SimpleType.java:20)
at org.codehaus.jackson.map.type.TypeFactory.<init>(TypeFactory.java:42)
at org.codehaus.jackson.map.type.TypeFactory.<clinit>(TypeFactory.java:15)
at org.codehaus.jackson.map.ObjectMapper.<clinit>(ObjectMapper.java:42)
at org.apache.avro.Schema.<clinit>(Schema.java:80)
at org.apache.avro.generic.GenericData.<clinit>(GenericData.java:862)
at org.apache.avro.specific.SpecificDatumReader.<init>(SpecificDatumReader.java:31)
at org.bdgenomics.adam.serialization.AvroSerializer.<init>(ADAMKryoRegistrator.scala:38)
at org.bdgenomics.adam.serialization.ADAMKryoRegistrator.registerClasses(ADAMKryoRegistrator.scala:68)
at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:64)
at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:61)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:61)
at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:116)
at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:79)
at org.apache.spark.broadcast.HttpBroadcast$.write(HttpBroadcast.scala:144)
at org.apache.spark.broadcast.HttpBroadcast.<init>(HttpBroadcast.scala:44)
at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcast.scala:73)
at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcast.scala:69)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(Broadcast.scala:95)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:617)
at org.apache.spark.rdd.NewHadoopRDD.<init>(NewHadoopRDD.scala:59)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:471)
at org.bdgenomics.adam.rdd.ADAMContext.adamParquetLoad(ADAMContext.scala:200)
at org.bdgenomics.adam.rdd.ADAMContext.adamSequenceLoad(ADAMContext.scala:312)
@jey thinks this is some sort of dependency hell issue.
This was removed during re-packaging a few months back (believe in #96).
With the refactoring from #96, we should be able to cleanly add a somatic variant caller in the genotyping stage.
Until we have a production ADAM release we need to depend on the SNAPSHOT releases from ADAM, but we should look to move off of this.
Reported by @rnpandya. We can see the following error:
2014-06-23 13:48:49 ERROR Executor:95 - Exception in task ID 8
java.lang.IllegalArgumentException: Received region with contig ID chrM, but do not have a matching reference contig.
at org.bdgenomics.avocado.partitioners.DefaultPartitionSet.getPartition(DefaultPartitioner.scala:150)
at org.bdgenomics.avocado.partitioners.PartitionSet.getPartition(PartitionSet.scala:86)
at org.bdgenomics.avocado.calls.reads.ReadCallHaplotypes$$anonfun$22.apply(ReadCallHaplotypes.scala:478)
at org.bdgenomics.avocado.calls.reads.ReadCallHaplotypes$$anonfun$22.apply(ReadCallHaplotypes.scala:476)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
With the following sequence dictionaries:
Sequence dictionary from contigs:
SequenceDictionary{
chrM->16299}
Sequence dictionary from reads:
SequenceDictionary{
chr17->95052499
chr12->119596406
chr18->90827538
chr13->120421117
chrY->91744698
chr3->159377569
chr5->151694555
chr2->181993374
chr1->194532772
chr15->104036045
chrX->170682864
chr8->129558836
chr7->142600211
chr11->122260198
chr16->98095624
chr9->124827824
chr4->155544171
chr10->130854437
chr14->124162362
chr6->149307077
chrM->16299}
I believe that this is due to an issue with string encoding in the ADAM ReferenceRegion object.
Similar to bigdatagenomics/adam@ac75e76, we should move to configuration flags.
#96 temporarily removes the local assembler. The local assembler should be added back, preferably without a strong reliance on the HMM aligner, which is slow.
Currently, we emit genotype records for all observed sites (i.e., a gVCF style record). In case we don't want this, we should have a filter which filters out "reference" calls.
bigdatagenomics/adam#468 changed the alignment loading approach from adamLoad
to loadAlignments
. We need to move this, as our tests are currently broken.
This happens if there's no alt/ref read coverage.
#72 adds code to align short reads with SNAP. We should generalize this code so that other aligners (e.g., BWA-MEM) can use the FASTQ distribution and BAM-->ADAM conversion.
Remove single sample call, and increase generality of pileup caller so that sufficient statistics can be merged in.
I haven't looked any deeper into this. Happy to provide input files to reproduce.
../avocado/bin/avocado-submit sim.bam.adam chr22.fa.adam sim.vcf.adam ../avocado/avocado-sample-configs/basic.properties
Spark assembly has been built with Hive, including Datanucleus jars on classpath
2015-04-06 18:47:48 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Loading reads in from sim.bam.adam
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 3.0 in stage 3.0 (TID 6)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 WARN TaskSetManager:71 - Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 ERROR TaskSetManager:75 - Task 3 in stage 3.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 6
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: reading another 6 footers
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.SplitStrategy: Using Client Side Metadata Split Strategy
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ClientSideMetadataSplitStrategy: There were no row groups that could be dropped due to filter predicates
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 2
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: reading another 2 footers
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.SplitStrategy: Using Client Side Metadata Split Strategy
Apr 6, 2015 6:47:51 PM INFO: parquet.hadoop.ClientSideMetadataSplitStrategy: There were no row groups that could be dropped due to filter predicates
Apr 6, 2015 6:47:53 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5131 records.
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Apr 6, 2015 6:47:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 75 ms. row count = 5131
Apr 6, 2015 6:47:54 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5131 records.
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Apr 6, 2015 6:47:54 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 20 ms. row count = 5131
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 2.0 in stage 3.0 (TID 5)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-04-06 18:47:56 ERROR Executor:96 - Exception in task 1.0 in stage 3.0 (TID 4)
java.lang.NullPointerException
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:54)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$readToObservations$1.apply(ReadExplorer.scala:46)
at org.apache.spark.rdd.Timer.time(Timer.scala:57)
at org.bdgenomics.avocado.discovery.ReadExplorer.readToObservations(ReadExplorer.scala:46)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at org.bdgenomics.avocado.discovery.ReadExplorer$$anonfun$discover$1$$anonfun$apply$3.apply(ReadExplorer.scala:170)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Right now only the variants are saved as output, but not the genotypes
Remove the dependency on configuration from the avocado-core module.
Users should be able to provide a BED file to mask the reference genome if they only want to process a subset of the genome (e.g., the exome, or a targeted sequencing panel).
From @fnothaft:
"I think we'd want to generally fix how we do the Haplotype scoring, as the current scoring system bumps the haplotype pair score for reads that could map ambiguously to either haplotype in the pair. I'm working on a proposal for this, that I hope to distribute for comment early next week."
Currently, we only have a config file for running avocado with SNAP as a frontend (https://github.com/bigdatagenomics/avocado/blob/master/avocado-sample-configs/snap-basic.properties); in reality, most people will want to just run from previously aligned reads, so we should add this config file.
Do we need both of these name fields? It seems the user defined name is only shorthand to set config, but we require the user to specify the full algorithm anyways so having two different names ( the user defined name and fixed name in the class) may be confusing.
The PR 418 (bigdatagenomics/adam#418) in the main adam repo removed the ADAM prefix from a number of class names as well as a few method names. The corresponding dependencies need to be fixed here in Avocado.
We've got a couple of merged branches with no commits on them: https://github.com/bigdatagenomics/avocado/branches?merged=1. Can we delete these branches?
Current ADAM master and release branch use ADAMNucleotideContig instead
We run into an error when converting back to VCF because HTSJDK doesn't like our (admittedly poor) deletion notation.
An ADAM 0.13.0 release will be forthcoming; in the meanwhile, we need to update to the latest 0.12.2-SNAPSHOT changes to fix the build.
Small nit, but we aren't getting the timing details from inside of the BiallelicGenotyper.
We need a method for generating (from normal exome sequencing data) "synthetic" BAMs that can be used to test the sensitivity/specificity of parameter settings in somatic genotypers. E.g. here is the Mutect description of a utility, "SomaticSpike,"
Using the published high-confidence single-nucleotide polymorphism (SNP)
genotypes for those samples from the 1000 Genomes Project, we identified
a set of sites that are heterozygous in NA12891 and homozygous for the
reference in NA12878. We then used a second utility, SomaticSpike, which
is part of the MuTect software package, to perform a mixing experiment in
silico. At each of the selected sites, this utility attempts to replace a number
of reads determined by a binomial distribution using a specified allelic fraction
in the NA12878 data with reads from the NA12891 data, therefore simulating
a somatic mutation of known location, type and expected allele fraction. If
there are not enough reads in NA12891 to replace the desired reads in
NA12878, the site is skipped. The output of this process is a virtual tumor BAM
with the in silico variants and a set of locations of those variants. Sensitivity is
then estimated by attempting to detect mutations at these sites.
This is necessary for closing out #127.
Currently maven looks for version 0.1.2-SNAPSHOT, but apparently this artefact cannot be found.
[ERROR] Failed to execute goal on project avocado-core: Could not resolve dependencies for project org.bdgenomics.avocado:avocado-core:jar:0.0.3-SNAPSHOT: The following artifacts could not be resolved: org.bdgenomics.bdg-utils:bdg-utils-metrics:jar:0.1.2-SNAPSHOT, org.bdgenomics.bdg-utils:bdg-utils-misc:jar:tests:0.1.2-SNAPSHOT: Could not find artifact org.bdgenomics.bdg-utils:bdg-utils-metrics:jar:0.1.2-SNAPSHOT in Sonatype (http://oss.sonatype.org/content/repositories/snapshots/) -> [Help 1]
Use the new version that is available: https://oss.sonatype.org/content/repositories/snapshots/org/bdgenomics/bdg-utils/bdg-utils-metrics/
Change line 26 of pom.xml as follows:
<utils.version>0.1.3-SNAPSHOT</utils.version>
Several things to do:
genotypeQuality
--> conditional genotype probabilitynonReferenceLikelihoods
strandBiasComponents
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.