phadej / tdigest Goto Github PK

View Code? Open in Web Editor NEW

30.0 3.0 7.0 384 KB

On-line accumulation of rank-based statistics such as quantiles and trimmed means

Haskell 99.48% Shell 0.52%

tdigest's Introduction

tdigest

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means.

See original paper: "Computing extremely accurate quantiles using t-digest" by Ted Dunning and Otmar Ertl

Synopsis

λ *Data.TDigest > median (tdigest [1..1000] :: TDigest 3)
Just 499.0090729817737

Benchmarks

Using 50M exponentially distributed numbers:

average: 16s; incorrect approximation of median, mostly to measure prng speed
sorting using vector-algorithms: 33s; using 1000MB of memory
sparking t-digest (using some par): 53s
buffered t-digest: 68s
sequential t-digest: 65s

Example histogram

tdigest-simple -m tdigest -d standard -s 100000 -c 10 -o output.svg -i 34
cp output.svg example.svg
inkscape --export-png=example.png --export-dpi=80 --export-background-opacity=0 --without-gui example.svg

tdigest's People

Contributors

Stargazers

Watchers

Forkers

silky alang9 mjhanninen willsewell maoe ocramz

tdigest's Issues

hackage bump

Can tdigest get a bump on the base constraint please? On hackage, base is <4.11

https://hackage.haskell.org/package/tdigest

Push tasty-1.2 fixes to hackage?

I see you have updated the tasty dependencies in your code, but the version on hackage still depends on tasty < 1.2. This means that tdigest does not currently build on nixpkgs master, where tasty-1.2 has landed.

Would you be able to push metadata revisions for fixed packages? Thanks.

SemiGroup laws

I don't think the Semigroup (TDigest comp) instance follows Semigroup laws, let alone Monoid laws

Non-empty TDigest

Does it make sense to implement an API for non-empty TDigest so quantile returns a plain Double rather than Maybe Double, and (minimum|maximum)Value never return infinity?

The non-empty TDigest wouldn't have the Monoid instance and also tdigest would look like either of

tdigest :: (KnownNat comp) => NonEmpty Double -> TDigest comp or
tdigest :: (Foldable f, KnownNat comp) => f Double -> Maybe (TDigest comp).

semigroupoids-6 not allowed

It works with allow-newer, so a revision might be sufficient. This is dependency of servant, which is how I stumbled upon this.

Inaccurate results when many identical samples. Compression factor does not change anything. TDigest.Tree

Hello.

edit: everything here only applied to Data.TDigest.Tree (hence Data.TDigest which uses it by default). So there is a behavior difference between both implementations. I realized that when finishing writting the issue, I tried to reformulate, but maybe it is not perfect.

First of all, thank for the implementation and for maintaining tdigest. Finding an already available implementation, which actually builds with GHC 9.8 is really appreciated.

I'm trying to use tdigest at https://www.novadiscovery.com/ to compute quantiles on the timepoints of timeseries. We have approximately 10^6 samples and we are computing between 1000 and 10000 TDigest in parallel and all of that is only possible with TDigest (otherwise we need Gigas of RAM ;)

We observed an annoying (but understandable) behavior (with Data.TDigest.Tree). One of our population is composed of more than half of the same value and some other values.

For example, the following code:

{-# LANGUAGE DataKinds #-}
import Data.TDigest
import Data.List (sort)

vals = [
  46.7864511022219,
  10,
  36.92111451489304,
  10,
  36.97800013368023,
  10,
  30.390316982189294,
  10,
  46.0225846217038,
  10,
  34.14061064477964,
  10,
  41.345007276606864,
  10,
  38.118272170954114,
  10,
  38.53237659493007,
  10,
  31.443159990206272,
  10
 ]

main = do
  let d = tdigest vals :: TDigest 1000

  print $ map (flip quantile d) [0, 0.25, 0.5, 0.75, 1]

We have a population of 20 samples, and half of them are 10, here are the sample in order:

[10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,30.390316982189294,31.443159990206272,34.14061064477964,36.92111451489304,36.97800013368023,38.118272170954114,38.53237659493007,41.345007276606864,46.0225846217038,46.7864511022219]

The result of computing quantiles 0, 0.25, 0.5, 0.75 and 1 are [Just 10.0,Just 15.097579245547323,Just 20.195158491094645,Just 37.54813615231717,Just 46.7864511022219].

As you can see, q0 and q1 are indeed correct. However, all other quantiles are largely inaccurate. For example, q0.25 should be 10 and it is actually 12.03. Actually, every quantile q where q < 0.5 should have 10 for the correct value.

I understand that TDigest is an approximation and hence I understand the result. However I'm surprised by:

The fact that changing the compression value (here I set 1000) have little to no impact on the result
I've tried the tdigest package in R:

> library(tdigest)
> vals = c(46.7864511022219,
  10,
  36.92111451489304,
  10,
  36.97800013368023,
  10,
  30.390316982189294,
  10,
  46.0225846217038,
  10,
  34.14061064477964,
  10,
  41.345007276606864,
  10,
  38.118272170954114,
  10,
  38.53237659493007,
  10,
  31.443159990206272,
  10)
> td <- tdigest(vals, 100)
> tquantile(td, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1))
[1] 10.00000 10.00000 10.00000 10.00000 10.00000 20.19516 37.54814 46.78645
> td <- tdigest(vals, 10)
> tquantile(td, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1))
[1] 10.00000 10.00000 10.59762 15.37859 17.76907 21.58438 37.67256 46.78645

And it shows the same inaccuracy with "low" (i.e. 10) compression ratio, but with a compression ratio of 100, it is fine.

I was actually sure that R will fail with a large population, but here are the results with R and haskell

> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 20), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.14904 11.79275 14.68885 16.33256 16.76587 18.45570
 [9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 30), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.00000 11.38654 14.89294 16.83902 16.99491 18.45382
 [9] 17.70872 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 40), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.00000 11.29545 14.68970 16.62703 17.03191 18.87390
 [9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 50), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.00000 10.97669 14.51943 16.57941 17.25110 16.98704
 [9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 60), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.00000 10.00710 10.02187 14.68962 20.00000 20.00000
 [9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 70), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.00000 10.00900 10.06510 14.47914 20.47230 20.00000
 [9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 80), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.00000 10.01259 10.11809 15.13858 20.00000 20.00000
 [9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 90), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.00000 10.00000 10.05731 14.79463 20.00000 20.00000
 [9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 100), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
 [1] 10.00000 10.00000 10.00000 10.00000 10.00057 14.81047 20.00000 20.00000
 [9] 20.00000 20.00000 20.00000

Here we have a population of 200_000 values, half of them are 10 and the other half is 20, and I used sample to shuffle the insertion order. Except for median which is a bit inaccurate, R tdigest gives pretty accurate results for compression factor >=60

In comparison, here is haskell TDigest:

ghci> v <- shuffleM $ replicate 100_000 (10 :: Double) ++ replicate 100_000 20
ghci> map (\q -> quantile q (tdigest v :: TDigest 10)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 50)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 100)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 1000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 10000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 100000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 1000000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]

Actually, it is funny that the median is perfect, but quantiles different than 0, 0.5, 1 seems linearly interpolated. Note how the compression factor have NO impact on the result.

Just adding on 0 to input completely shift the results: (see the (0:))

ghci> map (\q -> quantile q (tdigest (0:v) :: TDigest 1000000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 0.0,Just 6.99991,Just 8.99992,Just 10.999929999999999,Just 12.999940000000002,Just 14.99995,Just 15.999979999999999,Just 16.999985,Just 17.99999,Just 18.999995,Just 20.0]

Tested with:

GHC 9.6.4
tdigest from hackage, version 0.3
tdigest from R 0.4.1 https://cran.r-project.org/web/packages/tdigest/index.html

edit: just before submitting the issue, I realized that it exists the Data.TDigest.Vector alternative implementation, that I had NOT tested before.

Guess what? Results ARE different are more accurate:

ghci> map (\q -> quantile q (tdigest v :: TDigest 10)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 10.0,Just 10.0,Just 10.0,Just 10.0,Just 15.000000000000002,Just 20.0,Just 20.0,Just 20.0,Just 20.0,Just 20.0]

Questions

Why is TDigest.Tree different that TDigest.Vector?
Why is the compression factor on TDigest.Tree have NO effect on this problem (Note that TDigest.Vector seems to handle this case perfectly WHATEVER the compression factor, I need to check with a less arbitrary population).

Test suite failure for package tdigest-0.3 - --quickcheck-replay=207495

       Test suite failure for package tdigest-0.3
           tdigest-tests:  exited with: ExitFailure 1
       Full log available at /var/stackage/work/unpack-dir/.stack-work/logs/tdigest-0.3-test.log


           properties
             tdigest validity:   FAIL
               *** Failed! Falsified (after 35 tests and 34 shrinks):
               [-3.0,-4.0,-4.0,-2.0,0.0,0.0,-0.2,0.0,0.0,-2.0,0.0,-1.0,-0.2]
               invalid ordering Node 6 (-1.0) 0.1333506577679895 13.0 (Node 3 (-3.0) 1.875 7.7845982142857135 (Node 1 (-4.0) 2.0 2.0 Nil Nil) (Node 1 (-2.0) 3.909598214285714 3.909598214285714 Nil Nil)) (Node 2 (-0.2) 4.659307053986574 5.082051127946297 (Node 1 (-0.2) 0.4227440739597226 0.4227440739597226 Nil Nil) Nil)
               Use --quickcheck-replay=207495 to reproduce.
               Use -p '/tdigest validity/' to rerun this test only.
             histogram validity: OK
               +++ OK, passed 100 tests.

           1 out of 2 tests failed (0.01s)

Upload to Hackage

Any intention of uploading this to hackage?

Is the memory supposed to increase linearly with the inserts?

I have an application where I use this, and have observed that the memory usage increases linearly (slowly) as I insert more data in the t-digest.

Is this expected? I thought that by being an online stats calculator it would keep the memory constant

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.