isarn / isarn-sketches Goto Github PK
View Code? Open in Web Editor NEWSketching data structures for scala, including t-digest
License: Apache License 2.0
Sketching data structures for scala, including t-digest
License: Apache License 2.0
If I take the monoidal sum over some t-digests, it gives me some distribution behavior that looks a bit skewed:
scala> import org.isarnproject.sketchesAlgebirdAPI.implicits._
import org.isarnproject.sketchesAlgebirdAPI.implicits._
scala> val data = Vector.fill(100) { Vector.fill(10000) { scala.util.Random.nextGaussian() } }
data: scala.collection.immutable.Vector[scala.collection.immutable.Vector[Double]] = Vector(Vector(0.1778102040514962, ...
scala> val tdvec = data.map(TDigest.sketch(_))
tdvec: scala.collection.immutable.Vector[org.isarnproject.sketches.TDigest] = Vector(TDigest(0.5,75,TDigestMap(-4.215406387806561 -> (1.0, 1.0), ...
scala> val tdsum = com.twitter.algebird.Monoid.sum(tdvec)
tdsum: org.isarnproject.sketches.TDigest = TDigest(0.5,113,TDigestMap(-4.776867999224537 -> (1.0, 1.0), ...
scala> tdsum.cdf(0)
res5: Double = 0.46485982756854377 // should be close to 0.5
scala> tdsum.cdfInverse(0.5)
res6: Double = 0.19202097508054178 // should be close to 0
A theory (unverified) is that as the sum grows on the left-hand-side, the cluster mass disparities between LHS and RHS make the current ++
logic unstable.
Hi Erik,
I'm using isarn-sketches-spark package to compute T-digest on Spark and would like to persist TDigestSQL or TDigest data into a PostgreSQL database. Potentially I also would like to be able to load the TDigest data from the database.
What would be the best way of doing that, if ever possible? Thanks very much!
Richard
This extension of prefix-sum trees to multiple dimensions might apply:
https://arxiv.org/pdf/1311.6093.pdf
Pushkar Mishra (2013). "A New Algorithm for Updating and Querying Sub-arrays of Multidimensional Arrays". arXiv:1311.6093 [cs.DS]. doi:10.13140/RG.2.1.2394.2485
the main triggering scenario for re-clustering is if the distribution of the incoming data begins to move. In the "null-hypothesis" case of "the distribution is stationary" then the expected number of new clusters (as a function of N, the number of samples seen so far) is 2log(N)
- if the actual number of clusters is growing substantially faster than that, then it is most likely because new extrema are appearing faster than they "should be" due to new extrema being encountered.
The idea would be to work out the details of how to turn this basic idea into a more statistically grounded test for when to trigger a re-clustering.
Related: if the sketch needs a re-clustering due to the above scenario, then it implies that only the tails of the distribution need to be re-clustered. That is, a bunch of singleton-clusters that are accumulating at one or both ends. Figuring out a good way to re-cluster only these tails would be faster and probably more numerically stable. See for example this. Possibly there is a way to re-insert these tail clusters while using some kind of discounted measure of the quantile, so the insert logic is persuaded to make larger clusters closer in to the distribution body.
Greetings, Mr. Erlandson (@erikerlandson),
The team I represent recently started using this solution and we've been trying to tune accuracy and performance via
different C (compression) and M (maxDiscrete) parameters.
In our case, we had a dataframe called "originDf" which has 3 columns ("id": String, "fruit_type": String, "count": Integer)
+------+--------------------+-----------+
|id |fruit_type |count |
+------+--------------------+-----------+
|001|Apples |2 |
|002|Apricots |79 |
|001|Avocados |4 |
|003|Watermelon |13 |
|007|Blueberries |5 |
|007|Cherries |6 |
|007|Clementine |41 |
|007|Cucumbers |5 |
|007|Elderberry |3 |
|007|Eggfruit |1 |
|008|Eggfruit |19 |
|012|Clementine |61 |
|013|Blueberries |21 |
|014|Blueberries |4 |
|...|Lime |4 |
|...|Rambutan |3 |
|...|Strawberries |6 |
|...|Watermelon |5 |
|...|Tangerine |3 |
|...|Tangerine |6 |
+------+--------------------+-----------+
This dataframe has roughly 0.2 billion rows in total.
We then did the following:
val t_digest_udf1= TDigestAggregator.udf[Double](compression = C, maxDiscrete = M)
val groupDf1 = originDf.groupBy(col("fruit_type")).agg(t_digest_udf1(col("count")) as "t_digests",)
val udf1 = groupDf1.first()
val t1 = udf1.getAsTDigest
where (C, M) is respectively (100, 100), (0.01, 100), (100, 10000), and the resulting single TDigest from each set as t1, t2, and t3
t1: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)
t2: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)
t3: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)
It turns out t1, t2 & t3 (all of the org.isarnproject.sketches.java.TDigest class) look exactly the same value-wise.
Though Boolean checks such as t1.equal(t2) all return false, which indicates these are different TDigest entities somehow.
Do you see anything off in this usage? Did we use the TDigest correctly, and more importantly, is our understanding of the algorithm correct?
Thank you and we look forward to your responses.
Upgrading to sbt 1.5 leads to failure to resolve genjavadoc-plugin
transitive dependency
Not an issue with the project functionality but attempting to upgrade and feels wrong to not be able to resolve dependencies.
Upgrading to sbt
1.5.5 produces this
error] insecure HTTP request is unsupported 'Patterns(ivyPatterns=Vector(http://dl.bintray.com/content/sbt/sbt-plugin-releases/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)([branc
h]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), artifactPatterns=Vector(http://dl.bintray.com/content/sbt/sbt-plugin-releases/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)
([branch]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), isMavenCompatible=false, descriptorOptional=false, skipConsistencyCheck=false)'; switch to HTTPS or opt-in as Resolver.url("bintray-sbt-p
lugin-releases", url(...)).withAllowInsecureProtocol(true), or by using allowInsecureProtocol in repositories file
[error] java.lang.RuntimeException: insecure protocol is unsupported
[error] at scala.sys.package$.error(package.scala:30)
[error] at sbt.Classpaths$.errorInsecureProtocol(Defaults.scala:3312)
[error] at sbt.Classpaths$.$anonfun$mkIvyConfiguration$1(Defaults.scala:3906)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)
[error] at sbt.std.Transform$$anon$4.work(Transform.scala:68)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:282)
[error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:23)
[error] at sbt.Execute.work(Execute.scala:291)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:282)
[error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:64)
[error] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[error] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[error] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[error] at java.base/java.lang.Thread.run(Thread.java:829)
Fixing the resolver urls to use https produces this
sbt:isarn-sketches> compile
[info] Updating
[info] Resolved dependencies
[warn]
[warn] Note: Unresolved dependencies path:
[error] stack trace is suppressed; run last isarn_sketches_java / update for the full output
[error] (isarn_sketches_java / update) sbt.librarymanagement.ResolveException: Error downloading com.typesafe.genjavadoc:genjavadoc-plugin_2.12.14:0.15
[error] Not found
[error] Not found
[error] not found: /development/projects/10_misc/isarn-sketches/isarn-sketches/project/.ivy/localcom.typesafe.genjavadoc/genjavadoc-plugin_2.12.14/0.15/ivys/ivy.xml
[error] not found: https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.14/0.15/genjavadoc-plugin_2.12.14-0.15.pom
Have tried a number of things locating this plugin but am stuck.
This is a cross-reference to isarn/isarn-sketches-spark#12
Thanks to @JonathanTaws for reporting!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.