Coder Social home page Coder Social logo

utils's Introduction

bdg-utils

General utility code used across BDG products. Apache 2 licensed.

Coverage Status Maven Central

License

bdg-utils is released under an Apache 2.0 license.

utils's People

Contributors

akmorrow13 avatar dependabot[bot] avatar devin-petersohn avatar fnothaft avatar georgehe4 avatar hannes-ucsc avatar heuermh avatar huitseeker avatar jpdna avatar massie avatar nfergu avatar ryan-williams avatar tdanford avatar tomwhite avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

utils's Issues

Pull over metrics code from ADAM

ADAM contains useful code for capturing metrics from jobs run in Spark. A lot of this code is general purpose, and would be useful to a broader community.

CCing @nfergu, as a heads up

changing default artifactId: utils-parent-spark2_2.10

I see that the poms have utils-parent-spark2_2.10 as the artifactId. Are we moving upstream projects to spark2 or should this default to utils-parent_2.10 with the option to change to Spark 2 with move_to_spark_2.sh?

InstrumentedRDD hides partitioner

If you do:

val irdd = rdd.instrument
assert(irdd.partitioner.isDefined)

Your partitioner will not be defined, even if a partitioner was defined in the original rdd.

Duplicate declaration of spark-core dependency in utils-serialization

$ mvn clean install
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model
for org.bdgenomics.utils:utils-serialization_2.10:jar:0.2.3-SNAPSHOT
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)'
must be unique: org.apache.spark:spark-core_2.10:jar -> duplicate declaration
of version (?) @ org.bdgenomics.utils:utils-serialization_2.10:[unknown-version],
./utils-serialization/pom.xml, line 143, column 17

publish a spark 1.5 version

Any chance we can get a spark 1.5 version published on maven central? I changed the pom to 1.5.2 and all seemed well...

Thanks,

J

Initializing Metrics across project

I am trying to initialize metrics upon project start using Metrics.initialize(sc). However, I can not access it in other packages or files. Reinitialization will result in loss of past metrics. How can I initialize metrics once such that I can access it everywhere?

Asynchronous Requests Throw Exceptions

Not sure if metrics were made for this, but they throw an exception whenever code within a new timer block starts running while code within an existing timer block is still running.

This issue is specific to mango, where the server handles different http requests, and each http request handler is wrapped in a timer. Due to requests coming in at different times, a timer block will be triggered while another timer block is already being run.

Ideally we'd like to be able to instrument these asynchronous requests.

The exception and stack trace is shown below:

2016-03-01 15:37:21 WARN  ServletHandler:590 - Error for /reads/chr
java.lang.AssertionError: assertion failed: Timer name from on top of stack [/GET variants(3,false)] did not match passed-in timer name [GET alignment]
    at scala.Predef$.assert(Predef.scala:179)
    at org.bdgenomics.utils.instrumentation.MetricsRecorder.finishPhase(MetricsRecorder.scala:55)

IntervalArray should extend concrete type

Calling concreteIntervalArray.insert() or concreteIntervalArray.filter() will return IntervalArray(). For IntervalArray replace() function to work correctly with ConcreteIntervalArray, replace() should not return IntervalArray[K,T], but the derived type.
If there is a way around this, let me know. Otherwise I can submit a PR for this fix.

IntervalRDD fails when underlying RDD is not sorted

If RDD is not sorted, IntervalPartition does not force sorting of ConcreteIntervalArray. RDD's should not have to be sorted for IntervalRDD to work (one may only want sorted partitions, not entire sorted RDD).

Poisson model fit fails if you have a significant outlier

When training the Poisson mixture model, if you have a very significant outlier, this can cause all mixture distributions to have log likelihood of 0, which causes a NaN value to get aggregated. This in turn causes the mixture fit to fail.

Scoverage plugin not used properly

Version 0.99.2 of Scoverage plugin was added in commit https://github.com/bigdatagenomics/utils/blob/26eb8bd14ff548f7212317ae83f9d42070ee26de on 24 Sep 2014.

Since version 1.0.0 the plugin was completely rewritten (I'm the author) and it should be used differently.

Instead of creating special profile and adding magic parameters to scala-maven-plugin you have to just run mvn scoverage:report (or any other plugin goal, there are many of them - read usage section).

And most important thing - remove scalac-scoverage-plugin from dependencies.

BTW, tle latest version of the plugin is 1.3.0.

Add approximate LSH approach for MinHash

Currently, our MinHashing scheme falls back to a LSH scheme for approximate MinHashing. This provides a reduction in data replication from n to b (where n is the number of elements and b is the number of buckets). However, more efficient approximate LSH schemes can achieve a further reduction. We should add a method like multiprobing:

Lv, Qin, et al. "Multi-probe LSH: efficient indexing for high-dimensional similarity search." Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 2007.

easier way to get pretty printing of Metrics

Wrinting things like

val writer = new PrintWriter(new OutputStreamWriter(System.out))

is very cumbersome. And it often fails when you run it in SparkNotebook. Why not just giving Seq[String] that can be printed with mkString("\n") if needed?

Upgrade to Spark 1.2.0

Without pointing bdg-utils at Spark 1.2.0, I was getting odd errors using the instrumentation code (on Spark 1.2.0, that is).

Make SparkFunSuite public so it can be re-used in other projects

This trait is very useful for testing and could be re-used in other projects as well.
Being in the tests folder makes it not to be exported to the maven jar and then not available to other projects.

Possible solutions:

  • Move it to src of utils-misc
  • Move to another package called utils-testing
  • Create and export a jar for the tests code in utils-misc

java.lang.NoClassDefFoundError: org/bdgenomics/utils/misc/HadoopUtil

I ran these code (https://github.com/FusionWorks/jbrowse-adam), run by SBT

scala> AdamConverter.vcfToADAM("file:///home/leonis/jbrowse-adam-alldata/biodtfs/dbsnp_b37_20.vcf", "file:///home/leonis/dbsnp_b37_20.vcf.adam")

and received this error:

java.lang.NoClassDefFoundError: org/bdgenomics/utils/misc/HadoopUtil$
at org.bdgenomics.adam.rdd.ADAMContext.loadVcf(ADAMContext.scala:584)
at md.fusionworks.adam.jbrowse.tools.AdamConverter$.vcfToADAM(AdamConverter.scala:27)
... 43 elided
Caused by: java.lang.ClassNotFoundException: org.bdgenomics.utils.misc.HadoopUtil$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 45 more

Problem - broken (empty, don't have class files) library bdg-utils-misc (http://mvnrepository.com/artifact/org.bdgenomics.bdg-utils/bdg-utils-misc)

For now I decide this so:
compile bigdatagenomics/utils and manually add utils-misc_2.11-0.2.5-SNAPSHOT.jar to lib/

Please, fix libraries bdg-utils-misc at Maven repository, or publish new version

Moving IntervalRDD from Interval to ReferenceRegion

There are issues in IntervalRDD, specifically in the creation of the IntervalTree. These issues could be more easily resolved if we moved the IntervalTree and IntervalRDD from Interval to ReferenceRegion. This allows us to internally work with the referenceName. Are there any objections to this?

cross-compile to 2.11

As Spark now supports Scala 2.11 it would be great to see support for 2.11 in Adam, as Adam depends on bgd-utils bgd-utils should crosscompile to 2.10/2.11

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.