isarn / isarn-sketches Goto Github PK

View Code? Open in Web Editor NEW

15.0 4.0 5.0 1.35 MB

Sketching data structures for scala, including t-digest

License: Apache License 2.0

Scala 59.88% HTML 0.42% Java 39.70%

t-digest scala sketching data-sketching numeric probability-distribution algebird

isarn-sketches's Introduction

isarn

isarn-sketches's People

Contributors

Stargazers

Watchers

Forkers

erikerlandson mplearning skp33 semanticbeeng asoriano-stratio

isarn-sketches's Issues

Is the monoidal sum over multiple t-digests stable?

If I take the monoidal sum over some t-digests, it gives me some distribution behavior that looks a bit skewed:

reproducer:

scala> import org.isarnproject.sketchesAlgebirdAPI.implicits._
import org.isarnproject.sketchesAlgebirdAPI.implicits._

scala> val data = Vector.fill(100) { Vector.fill(10000) { scala.util.Random.nextGaussian() } }
data: scala.collection.immutable.Vector[scala.collection.immutable.Vector[Double]] = Vector(Vector(0.1778102040514962, ...

scala> val tdvec = data.map(TDigest.sketch(_))
tdvec: scala.collection.immutable.Vector[org.isarnproject.sketches.TDigest] = Vector(TDigest(0.5,75,TDigestMap(-4.215406387806561 -> (1.0, 1.0), ...

scala> val tdsum = com.twitter.algebird.Monoid.sum(tdvec)
tdsum: org.isarnproject.sketches.TDigest = TDigest(0.5,113,TDigestMap(-4.776867999224537 -> (1.0, 1.0), ...

scala> tdsum.cdf(0)
res5: Double = 0.46485982756854377    // should be close to 0.5

scala> tdsum.cdfInverse(0.5)
res6: Double = 0.19202097508054178    // should be close to 0

A theory (unverified) is that as the sum grows on the left-hand-side, the cluster mass disparities between LHS and RHS make the current ++ logic unstable.

Save and load TDigest?

Hi Erik,

I'm using isarn-sketches-spark package to compute T-digest on Spark and would like to persist TDigestSQL or TDigest data into a PostgreSQL database. Potentially I also would like to be able to load the TDigest data from the database.

What would be the best way of doing that, if ever possible? Thanks very much!

Richard

multi-dimensional t-digest?

This extension of prefix-sum trees to multiple dimensions might apply:
https://arxiv.org/pdf/1311.6093.pdf
Pushkar Mishra (2013). "A New Algorithm for Updating and Querying Sub-arrays of Multidimensional Arrays". arXiv:1311.6093 [cs.DS]. doi:10.13140/RG.2.1.2394.2485

more mathematically grounded heuristic for recluster triggering

the main triggering scenario for re-clustering is if the distribution of the incoming data begins to move. In the "null-hypothesis" case of "the distribution is stationary" then the expected number of new clusters (as a function of N, the number of samples seen so far) is 2log(N) - if the actual number of clusters is growing substantially faster than that, then it is most likely because new extrema are appearing faster than they "should be" due to new extrema being encountered.

The idea would be to work out the details of how to turn this basic idea into a more statistically grounded test for when to trigger a re-clustering.

Related: if the sketch needs a re-clustering due to the above scenario, then it implies that only the tails of the distribution need to be re-clustered. That is, a bunch of singleton-clusters that are accumulating at one or both ends. Figuring out a good way to re-cluster only these tails would be faster and probably more numerically stable. See for example this. Possibly there is a way to re-insert these tail clusters while using some kind of discounted measure of the quantile, so the insert logic is persuaded to make larger clusters closer in to the distribution body.

C (compression) and M (maxDiscrete) parameters return the same results

Greetings, Mr. Erlandson (@erikerlandson),

The team I represent recently started using this solution and we've been trying to tune accuracy and performance via
different C (compression) and M (maxDiscrete) parameters.

In our case, we had a dataframe called "originDf" which has 3 columns ("id": String, "fruit_type": String, "count": Integer)

+------+--------------------+-----------+
|id |fruit_type |count |
+------+--------------------+-----------+
|001|Apples |2 |
|002|Apricots |79 |
|001|Avocados |4 |
|003|Watermelon |13 |
|007|Blueberries |5 |
|007|Cherries |6 |
|007|Clementine |41 |
|007|Cucumbers |5 |
|007|Elderberry |3 |
|007|Eggfruit |1 |
|008|Eggfruit |19 |
|012|Clementine |61 |
|013|Blueberries |21 |
|014|Blueberries |4 |
|...|Lime |4 |
|...|Rambutan |3 |
|...|Strawberries |6 |
|...|Watermelon |5 |
|...|Tangerine |3 |
|...|Tangerine |6 |
+------+--------------------+-----------+
This dataframe has roughly 0.2 billion rows in total.

We then did the following:

val t_digest_udf1= TDigestAggregator.udf[Double](compression = C, maxDiscrete = M)
val groupDf1 = originDf.groupBy(col("fruit_type")).agg(t_digest_udf1(col("count")) as "t_digests",)
val udf1 = groupDf1.first()
val t1 = udf1.getAsTDigest

where (C, M) is respectively (100, 100), (0.01, 100), (100, 10000), and the resulting single TDigest from each set as t1, t2, and t3

t1: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t2: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t3: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

It turns out t1, t2 & t3 (all of the org.isarnproject.sketches.java.TDigest class) look exactly the same value-wise.
Though Boolean checks such as t1.equal(t2) all return false, which indicates these are different TDigest entities somehow.

Do you see anything off in this usage? Did we use the TDigest correctly, and more importantly, is our understanding of the algorithm correct?

Thank you and we look forward to your responses.

Upgrading to sbt 1.5 leads to failure to resolve dependency

Upgrading to sbt 1.5 leads to failure to resolve genjavadoc-plugin transitive dependency

Not an issue with the project functionality but attempting to upgrade and feels wrong to not be able to resolve dependencies.

Upgrading to sbt 1.5.5 produces this

error] insecure HTTP request is unsupported 'Patterns(ivyPatterns=Vector(http://dl.bintray.com/content/sbt/sbt-plugin-releases/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)([branc
h]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), artifactPatterns=Vector(http://dl.bintray.com/content/sbt/sbt-plugin-releases/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)
([branch]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), isMavenCompatible=false, descriptorOptional=false, skipConsistencyCheck=false)'; switch to HTTPS or opt-in as Resolver.url("bintray-sbt-p
lugin-releases", url(...)).withAllowInsecureProtocol(true), or by using allowInsecureProtocol in repositories file                                                                                       
[error] java.lang.RuntimeException: insecure protocol is unsupported                                                                                                                                     
[error]         at scala.sys.package$.error(package.scala:30)                                       
[error]         at sbt.Classpaths$.errorInsecureProtocol(Defaults.scala:3312)                       
[error]         at sbt.Classpaths$.$anonfun$mkIvyConfiguration$1(Defaults.scala:3906)               
[error]         at scala.Function1.$anonfun$compose$1(Function1.scala:49)                           
[error]         at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)       
[error]         at sbt.std.Transform$$anon$4.work(Transform.scala:68)                               
[error]         at sbt.Execute.$anonfun$submit$2(Execute.scala:282)                                 
[error]         at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:23)             
[error]         at sbt.Execute.work(Execute.scala:291)                                              
[error]         at sbt.Execute.$anonfun$submit$1(Execute.scala:282)                                 
[error]         at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error]         at sbt.CompletionService$$anon$2.call(CompletionService.scala:64)                   
[error]         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)               
[error]         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[error]         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)               
[error]         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[error]         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[error]         at java.base/java.lang.Thread.run(Thread.java:829)

Fixing the resolver urls to use https produces this

sbt:isarn-sketches> compile
[info] Updating 
[info] Resolved  dependencies
[warn] 
[warn]  Note: Unresolved dependencies path:
[error] stack trace is suppressed; run last isarn_sketches_java / update for the full output
[error] (isarn_sketches_java / update) sbt.librarymanagement.ResolveException: Error downloading com.typesafe.genjavadoc:genjavadoc-plugin_2.12.14:0.15
[error]   Not found
[error]   Not found
[error]   not found: /development/projects/10_misc/isarn-sketches/isarn-sketches/project/.ivy/localcom.typesafe.genjavadoc/genjavadoc-plugin_2.12.14/0.15/ivys/ivy.xml
[error]   not found: https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.14/0.15/genjavadoc-plugin_2.12.14-0.15.pom

Have tried a number of things locating this plugin but am stuck.

cdfInverse results in wrong order of values on monotonic distribution with large ranges

This is a cross-reference to isarn/isarn-sketches-spark#12
Thanks to @JonathanTaws for reporting!

isarn / isarn-sketches Goto Github PK

isarn-sketches's Introduction

isarn

isarn-sketches's People

Contributors

Stargazers

Watchers

Forkers

isarn-sketches's Issues

Is the monoidal sum over multiple t-digests stable?

reproducer:

Save and load TDigest?

multi-dimensional t-digest?

more mathematically grounded heuristic for recluster triggering

C (compression) and M (maxDiscrete) parameters return the same results

Upgrading to sbt 1.5 leads to failure to resolve dependency

cdfInverse results in wrong order of values on monotonic distribution with large ranges

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent