General utility code used across BDG products. Apache 2 licensed.
bdg-utils is released under an Apache 2.0 license.
General utility code used across BDG products. Apache 2 licensed.
License: Apache License 2.0
General utility code used across BDG products. Apache 2 licensed.
bdg-utils is released under an Apache 2.0 license.
This is the accompanying issue to bigdatagenomics/adam#603.
ADAM contains useful code for capturing metrics from jobs run in Spark. A lot of this code is general purpose, and would be useful to a broader community.
CCing @nfergu, as a heads up
Cross linking to bigdatagenomics/adam#1334.
See discussion in bigdatagenomics/adam#690.
We depend on the AWS SDK for the Serializable AWS credentials. I believe that this is dead code.
I see that the poms have utils-parent-spark2_2.10 as the artifactId. Are we moving upstream projects to spark2 or should this default to utils-parent_2.10 with the option to change to Spark 2 with move_to_spark_2.sh?
See #78 (comment)
ported from / syncs with ADAM#662.
If you do:
val irdd = rdd.instrument
assert(irdd.partitioner.isDefined)
Your partitioner will not be defined, even if a partitioner was defined in the original rdd
.
$ mvn clean install
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model
for org.bdgenomics.utils:utils-serialization_2.10:jar:0.2.3-SNAPSHOT
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)'
must be unique: org.apache.spark:spark-core_2.10:jar -> duplicate declaration
of version (?) @ org.bdgenomics.utils:utils-serialization_2.10:[unknown-version],
./utils-serialization/pom.xml, line 143, column 17
There's a large amount of Parquet helper code in ADAM, some of which is non-ADAM specific. We should migrate that code out.
We've added Jaccard similarity estimation via MinHashing to PacMin. This code is generic, so we should migrate it into this repository.
I think we should revert this back. Here are the lines in question: Thread unsafe variable declaration, Use in the get method. Once I took this out I no longer got the failures.
The coverage-regions
PR in adam includes a class, PairingRDD, that should live over here instead.
Any chance we can get a spark 1.5 version published on maven central? I changed the pom to 1.5.2 and all seemed well...
Thanks,
J
We need to port bigdatagenomics/adam#557 over.
IntervalRDD from bigdatagenomics/mango should be added to utils
I am trying to initialize metrics upon project start using Metrics.initialize(sc). However, I can not access it in other packages or files. Reinitialization will result in loss of past metrics. How can I initialize metrics once such that I can access it everywhere?
Not sure if metrics were made for this, but they throw an exception whenever code within a new timer block starts running while code within an existing timer block is still running.
This issue is specific to mango, where the server handles different http requests, and each http request handler is wrapped in a timer. Due to requests coming in at different times, a timer block will be triggered while another timer block is already being run.
Ideally we'd like to be able to instrument these asynchronous requests.
The exception and stack trace is shown below:
2016-03-01 15:37:21 WARN ServletHandler:590 - Error for /reads/chr
java.lang.AssertionError: assertion failed: Timer name from on top of stack [/GET variants(3,false)] did not match passed-in timer name [GET alignment]
at scala.Predef$.assert(Predef.scala:179)
at org.bdgenomics.utils.instrumentation.MetricsRecorder.finishPhase(MetricsRecorder.scala:55)
Calling concreteIntervalArray.insert() or concreteIntervalArray.filter() will return IntervalArray(). For IntervalArray replace() function to work correctly with ConcreteIntervalArray, replace() should not return IntervalArray[K,T], but the derived type.
If there is a way around this, let me know. Otherwise I can submit a PR for this fix.
The release tag name format
<tagNameFormat>utils-parent-${project.version}_2.10</tagNameFormat>
should be
<tagNameFormat>utils-parent_2.10-${project.version}</tagNameFormat>
to more closely match the artifact name, see
http://search.maven.org/#artifactdetails|org.bdgenomics.utils|utils-parent_2.10|0.2.2|pom
The SparkFunSuite in ADAM is general purpose code, and should be migrated out of ADAM.
I need a way to determine the closest non-overlapping region in case there is no overlapping region.
If RDD is not sorted, IntervalPartition does not force sorting of ConcreteIntervalArray. RDD's should not have to be sorted for IntervalRDD to work (one may only want sorted partitions, not entire sorted RDD).
When training the Poisson mixture model, if you have a very significant outlier, this can cause all mixture distributions to have log likelihood of 0, which causes a NaN value to get aggregated. This in turn causes the mixture fit to fail.
Not clear exactly how this happens, but if you call .cache
on an InstrumentedRDD
, the RDD doesn't get cached.
Version 0.99.2
of Scoverage plugin was added in commit https://github.com/bigdatagenomics/utils/blob/26eb8bd14ff548f7212317ae83f9d42070ee26de on 24 Sep 2014.
Since version 1.0.0
the plugin was completely rewritten (I'm the author) and it should be used differently.
Instead of creating special profile and adding magic parameters to scala-maven-plugin
you have to just run mvn scoverage:report
(or any other plugin goal, there are many of them - read usage section).
And most important thing - remove scalac-scoverage-plugin
from dependencies.
BTW, tle latest version of the plugin is 1.3.0
.
SparkFunSuite is more broadly applicable outside this repo, I think; I'm duplicating logic from it now in pageant and Guacamole already has a dupe of it.
Would publishing a test-specific artifact for it make sense, or putting it in its own repo, a la holdenk/spark-testing-base?
You have a dependency on Apache Spark. You can register your package at
http://spark-packages.org/register.
It's "[a] community index of packages for Apache Spark."
Currently, our MinHashing scheme falls back to a LSH scheme for approximate MinHashing. This provides a reduction in data replication from n to b (where n is the number of elements and b is the number of buckets). However, more efficient approximate LSH schemes can achieve a further reduction. We should add a method like multiprobing:
Lv, Qin, et al. "Multi-probe LSH: efficient indexing for high-dimensional similarity search." Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 2007.
If the code being run inside the run
function of a BDGCommand throws an exception, this causes the metrics to not print out, which can be a bit frustrating!
We have a test that fails due to a temp output directory existing. We should fix this a la bigdatagenomics/adam@ed882c0#diff-af88c24e738248e4fdec0c4e4ab9d2dd.
The latest version of SparkFunSuite (PR #404 in adam) includes a method, testFile
, that should be pulled into the SparkFunSuite class here.
The bdg-utils-misc tests jar includes e.g. SparkFunSuite
, but there are no sources for that available in Maven central, afaict.
Wrinting things like
val writer = new PrintWriter(new OutputStreamWriter(System.out))
is very cumbersome. And it often fails when you run it in SparkNotebook. Why not just giving Seq[String] that can be printed with mkString("\n") if needed?
Without pointing bdg-utils at Spark 1.2.0, I was getting odd errors using the instrumentation code (on Spark 1.2.0, that is).
Needed for bigdatagenomics/adam#1093.
This trait is very useful for testing and could be re-used in other projects as well.
Being in the tests folder makes it not to be exported to the maven jar and then not available to other projects.
Possible solutions:
We need to fit a mixture of Poisson distributions for bigdatagenomics/adam#401.
I ran these code (https://github.com/FusionWorks/jbrowse-adam), run by SBT
scala> AdamConverter.vcfToADAM("file:///home/leonis/jbrowse-adam-alldata/biodtfs/dbsnp_b37_20.vcf", "file:///home/leonis/dbsnp_b37_20.vcf.adam")
and received this error:
java.lang.NoClassDefFoundError: org/bdgenomics/utils/misc/HadoopUtil$
at org.bdgenomics.adam.rdd.ADAMContext.loadVcf(ADAMContext.scala:584)
at md.fusionworks.adam.jbrowse.tools.AdamConverter$.vcfToADAM(AdamConverter.scala:27)
... 43 elided
Caused by: java.lang.ClassNotFoundException: org.bdgenomics.utils.misc.HadoopUtil$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 45 more
Problem - broken (empty, don't have class files) library bdg-utils-misc (http://mvnrepository.com/artifact/org.bdgenomics.bdg-utils/bdg-utils-misc)
For now I decide this so:
compile bigdatagenomics/utils and manually add utils-misc_2.11-0.2.5-SNAPSHOT.jar to lib/
Please, fix libraries bdg-utils-misc at Maven repository, or publish new version
There are issues in IntervalRDD, specifically in the creation of the IntervalTree. These issues could be more easily resolved if we moved the IntervalTree and IntervalRDD from Interval to ReferenceRegion. This allows us to internally work with the referenceName. Are there any objections to this?
See comment #106 (comment)
Lost a couple of hours on Friday not noticing that trying to move to Spark 2.x from Scala 2.11 doesn't actually accomplish anything
find . -name "pom.xml" -exec sed -e "/utils-/ s/_2.10/-spark2_2.10/g"
https://github.com/bigdatagenomics/utils/blob/master/scripts/move_to_spark_2.sh#L5
Utils mirror of bigdatagenomics/adam#1225.
As Spark now supports Scala 2.11 it would be great to see support for 2.11 in Adam, as Adam depends on bgd-utils bgd-utils should crosscompile to 2.10/2.11
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.