The utils from bigdatagenomics

Move CLI helper classes in from ADAM

This is the accompanying issue to bigdatagenomics/adam#603.

Pull over metrics code from ADAM

ADAM contains useful code for capturing metrics from jobs run in Spark. A lot of this code is general purpose, and would be useful to a broader community.

CCing @nfergu, as a heads up

Type erasure confounds IntervalTreeSerializer...

Cross linking to bigdatagenomics/adam#1334.

Silence Spark logging in unit test suites

Remove AWS dependency

See discussion in bigdatagenomics/adam#690.

We depend on the AWS SDK for the Serializable AWS credentials. I believe that this is dead code.

changing default artifactId: utils-parent-spark2_2.10

I see that the poms have utils-parent-spark2_2.10 as the artifactId. Are we moving upstream projects to spark2 or should this default to utils-parent_2.10 with the option to change to Spark 2 with move_to_spark_2.sh?

Git HEAD was not moved back to Scala 2.10 after cutting 0.2.4 release

Release script is missing push after scala 2.10 commit

See #78 (comment)

Kill `Job` argument to BDGSparkCommand

ported from / syncs with ADAM#662.

InstrumentedRDD hides partitioner

If you do:

val irdd = rdd.instrument
assert(irdd.partitioner.isDefined)

Your partitioner will not be defined, even if a partitioner was defined in the original rdd.

Duplicate declaration of spark-core dependency in utils-serialization

$ mvn clean install
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model
for org.bdgenomics.utils:utils-serialization_2.10:jar:0.2.3-SNAPSHOT
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)'
must be unique: org.apache.spark:spark-core_2.10:jar -> duplicate declaration
of version (?) @ org.bdgenomics.utils:utils-serialization_2.10:[unknown-version],
./utils-serialization/pom.xml, line 143, column 17

Migrate Parquet code out of ADAM

There's a large amount of Parquet helper code in ADAM, some of which is non-ADAM specific. We should migrate that code out.

Migrate MinHashing out of PacMin

We've added Jaccard similarity estimation via MinHashing to PacMin. This code is generic, so we should migrate it into this repository.

Release utils version 0.2.8

Thread unsafe variable causing undefined/irregular behavior for ADAM

I think we should revert this back. Here are the lines in question: Thread unsafe variable declaration, Use in the get method. Once I took this out I no longer got the failures.

Move the PairingRDD class and tests over from adam-core

The coverage-regions PR in adam includes a class, PairingRDD, that should live over here instead.

Add Coveralls to CI build

publish a spark 1.5 version

Any chance we can get a spark 1.5 version published on maven central? I changed the pom to 1.5.2 and all seemed well...

Thanks,

J

Port unit test fix from ADAM

We need to port bigdatagenomics/adam#557 over.

Add generic IntervalRDD to utils

IntervalRDD from bigdatagenomics/mango should be added to utils

Initializing Metrics across project

I am trying to initialize metrics upon project start using Metrics.initialize(sc). However, I can not access it in other packages or files. Reinitialization will result in loss of past metrics. How can I initialize metrics once such that I can access it everywhere?

Asynchronous Requests Throw Exceptions

Not sure if metrics were made for this, but they throw an exception whenever code within a new timer block starts running while code within an existing timer block is still running.

This issue is specific to mango, where the server handles different http requests, and each http request handler is wrapped in a timer. Due to requests coming in at different times, a timer block will be triggered while another timer block is already being run.

Ideally we'd like to be able to instrument these asynchronous requests.

The exception and stack trace is shown below:

2016-03-01 15:37:21 WARN  ServletHandler:590 - Error for /reads/chr
java.lang.AssertionError: assertion failed: Timer name from on top of stack [/GET variants(3,false)] did not match passed-in timer name [GET alignment]
    at scala.Predef$.assert(Predef.scala:179)
    at org.bdgenomics.utils.instrumentation.MetricsRecorder.finishPhase(MetricsRecorder.scala:55)

IntervalArray should extend concrete type

Calling concreteIntervalArray.insert() or concreteIntervalArray.filter() will return IntervalArray(). For IntervalArray replace() function to work correctly with ConcreteIntervalArray, replace() should not return IntervalArray[K,T], but the derived type.
If there is a way around this, let me know. Otherwise I can submit a PR for this fix.

Release tag name format should match artifact name

The release tag name format

<tagNameFormat>utils-parent-${project.version}_2.10</tagNameFormat>

should be

<tagNameFormat>utils-parent_2.10-${project.version}</tagNameFormat>

to more closely match the artifact name, see
http://search.maven.org/#artifactdetails|org.bdgenomics.utils|utils-parent_2.10|0.2.2|pom

Migrate SparkFunSuite out of ADAM

The SparkFunSuite in ADAM is general purpose code, and should be migrated out of ADAM.

Binary Search returns option, however it may be useful to return nearest non-overlapping index

I need a way to determine the closest non-overlapping region in case there is no overlapping region.

IntervalRDD fails when underlying RDD is not sorted

If RDD is not sorted, IntervalPartition does not force sorting of ConcreteIntervalArray. RDD's should not have to be sorted for IntervalRDD to work (one may only want sorted partitions, not entire sorted RDD).

Poisson model fit fails if you have a significant outlier

When training the Poisson mixture model, if you have a very significant outlier, this can cause all mixture distributions to have log likelihood of 0, which causes a NaN value to get aggregated. This in turn causes the mixture fit to fail.

Instrumentation breaks caching

Not clear exactly how this happens, but if you call .cache on an InstrumentedRDD, the RDD doesn't get cached.

Push artifacts to Maven

Scoverage plugin not used properly

Version 0.99.2 of Scoverage plugin was added in commit https://github.com/bigdatagenomics/utils/blob/26eb8bd14ff548f7212317ae83f9d42070ee26de on 24 Sep 2014.

Since version 1.0.0 the plugin was completely rewritten (I'm the author) and it should be used differently.

Instead of creating special profile and adding magic parameters to scala-maven-plugin you have to just run mvn scoverage:report (or any other plugin goal, there are many of them - read usage section).

And most important thing - remove scalac-scoverage-plugin from dependencies.

BTW, tle latest version of the plugin is 1.3.0.

BDGCommand sets command name with "adam: " prefix

Publish SparkFunSuite?

SparkFunSuite is more broadly applicable outside this repo, I think; I'm duplicating logic from it now in pageant and Guacamole already has a dupe of it.

Would publishing a test-specific artifact for it make sense, or putting it in its own repo, a la holdenk/spark-testing-base?

Register your spark package

You have a dependency on Apache Spark. You can register your package at
http://spark-packages.org/register.
It's "[a] community index of packages for Apache Spark."

Add approximate LSH approach for MinHash

Currently, our MinHashing scheme falls back to a LSH scheme for approximate MinHashing. This provides a reduction in data replication from n to b (where n is the number of elements and b is the number of buckets). However, more efficient approximate LSH schemes can achieve a further reduction. We should add a method like multiprobing:

Lv, Qin, et al. "Multi-probe LSH: efficient indexing for high-dimensional similarity search." Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 2007.

Metrics do not print out if CLI job fails

If the code being run inside the run function of a BDGCommand throws an exception, this causes the metrics to not print out, which can be a bit frustrating!

Fix flagging test (set tempdir in jenkins test)

We have a test that fails due to a temp output directory existing. We should fix this a la bigdatagenomics/adam@ed882c0#diff-af88c24e738248e4fdec0c4e4ab9d2dd.

Extract the testFile command from the latest version of SparkFunSuite

The latest version of SparkFunSuite (PR #404 in adam) includes a method, testFile, that should be pulled into the SparkFunSuite class here.

Publish test-sources jar

The bdg-utils-misc tests jar includes e.g. SparkFunSuite, but there are no sources for that available in Maven central, afaict.

easier way to get pretty printing of Metrics

Wrinting things like

val writer = new PrintWriter(new OutputStreamWriter(System.out))

is very cumbersome. And it often fails when you run it in SparkNotebook. Why not just giving Seq[String] that can be printed with mkString("\n") if needed?

Upgrade to Spark 1.2.0

Without pointing bdg-utils at Spark 1.2.0, I was getting odd errors using the instrumentation code (on Spark 1.2.0, that is).

Add support for Spark 2.0.0

Needed for bigdatagenomics/adam#1093.

Make SparkFunSuite public so it can be re-used in other projects

This trait is very useful for testing and could be re-used in other projects as well.
Being in the tests folder makes it not to be exported to the maven jar and then not available to other projects.

Possible solutions:

Move it to src of utils-misc
Move to another package called utils-testing
Create and export a jar for the tests code in utils-misc

Add statistical model fitting code

We need to fit a mixture of Poisson distributions for bigdatagenomics/adam#401.

java.lang.NoClassDefFoundError: org/bdgenomics/utils/misc/HadoopUtil

I ran these code (https://github.com/FusionWorks/jbrowse-adam), run by SBT

scala> AdamConverter.vcfToADAM("file:///home/leonis/jbrowse-adam-alldata/biodtfs/dbsnp_b37_20.vcf", "file:///home/leonis/dbsnp_b37_20.vcf.adam")

and received this error:

java.lang.NoClassDefFoundError: org/bdgenomics/utils/misc/HadoopUtil$
at org.bdgenomics.adam.rdd.ADAMContext.loadVcf(ADAMContext.scala:584)
at md.fusionworks.adam.jbrowse.tools.AdamConverter$.vcfToADAM(AdamConverter.scala:27)
... 43 elided
Caused by: java.lang.ClassNotFoundException: org.bdgenomics.utils.misc.HadoopUtil$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 45 more

Problem - broken (empty, don't have class files) library bdg-utils-misc (http://mvnrepository.com/artifact/org.bdgenomics.bdg-utils/bdg-utils-misc)

For now I decide this so:
compile bigdatagenomics/utils and manually add utils-misc_2.11-0.2.5-SNAPSHOT.jar to lib/

Please, fix libraries bdg-utils-misc at Maven repository, or publish new version

Moving IntervalRDD from Interval to ReferenceRegion

There are issues in IntervalRDD, specifically in the creation of the IntervalTree. These issues could be more easily resolved if we moved the IntervalTree and IntervalRDD from Interval to ReferenceRegion. This allows us to internally work with the referenceName. Are there any objections to this?

Fix scaladocs in IntervalArray

See comment #106 (comment)

Migrate move_to_* scripts from ADAM to utils

Lost a couple of hours on Friday not noticing that trying to move to Spark 2.x from Scala 2.11 doesn't actually accomplish anything

find . -name "pom.xml" -exec sed -e "/utils-/ s/_2.10/-spark2_2.10/g"

https://github.com/bigdatagenomics/utils/blob/master/scripts/move_to_spark_2.sh#L5

Move ADAMFunSuite helper functions upstream to SparkFunSuite

Utils mirror of bigdatagenomics/adam#1225.

cross-compile to 2.11

As Spark now supports Scala 2.11 it would be great to see support for 2.11 in Adam, as Adam depends on bgd-utils bgd-utils should crosscompile to 2.10/2.11

bigdatagenomics / utils Goto Github PK

utils's Introduction

bdg-utils

License

utils's People

Contributors

Stargazers

Watchers

Forkers

utils's Issues

Recommend Projects

Recommend Topics

Recommend Org