Hello.
edit: everything here only applied to Data.TDigest.Tree
(hence Data.TDigest
which uses it by default). So there is a behavior difference between both implementations. I realized that when finishing writting the issue, I tried to reformulate, but maybe it is not perfect.
First of all, thank for the implementation and for maintaining tdigest
. Finding an already available implementation, which actually builds with GHC 9.8 is really appreciated.
I'm trying to use tdigest
at https://www.novadiscovery.com/ to compute quantiles on the timepoints of timeseries. We have approximately 10^6
samples and we are computing between 1000
and 10000
TDigest
in parallel and all of that is only possible with TDigest
(otherwise we need Gigas of RAM ;)
We observed an annoying (but understandable) behavior (with Data.TDigest.Tree
). One of our population is composed of more than half of the same value and some other values.
For example, the following code:
{-# LANGUAGE DataKinds #-}
import Data.TDigest
import Data.List (sort)
vals = [
46.7864511022219,
10,
36.92111451489304,
10,
36.97800013368023,
10,
30.390316982189294,
10,
46.0225846217038,
10,
34.14061064477964,
10,
41.345007276606864,
10,
38.118272170954114,
10,
38.53237659493007,
10,
31.443159990206272,
10
]
main = do
let d = tdigest vals :: TDigest 1000
print $ map (flip quantile d) [0, 0.25, 0.5, 0.75, 1]
We have a population of 20
samples, and half of them are 10
, here are the sample in order:
[10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,30.390316982189294,31.443159990206272,34.14061064477964,36.92111451489304,36.97800013368023,38.118272170954114,38.53237659493007,41.345007276606864,46.0225846217038,46.7864511022219]
The result of computing quantiles 0
, 0.25
, 0.5
, 0.75
and 1
are [Just 10.0,Just 15.097579245547323,Just 20.195158491094645,Just 37.54813615231717,Just 46.7864511022219]
.
As you can see, q0
and q1
are indeed correct. However, all other quantiles are largely inaccurate. For example, q0.25
should be 10
and it is actually 12.03
. Actually, every quantile q
where q < 0.5
should have 10
for the correct value.
I understand that TDigest
is an approximation and hence I understand the result. However I'm surprised by:
- The fact that changing the compression value (here I set
1000
) have little to no impact on the result
- I've tried the tdigest package in R:
> library(tdigest)
> vals = c(46.7864511022219,
10,
36.92111451489304,
10,
36.97800013368023,
10,
30.390316982189294,
10,
46.0225846217038,
10,
34.14061064477964,
10,
41.345007276606864,
10,
38.118272170954114,
10,
38.53237659493007,
10,
31.443159990206272,
10)
> td <- tdigest(vals, 100)
> tquantile(td, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1))
[1] 10.00000 10.00000 10.00000 10.00000 10.00000 20.19516 37.54814 46.78645
> td <- tdigest(vals, 10)
> tquantile(td, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1))
[1] 10.00000 10.00000 10.59762 15.37859 17.76907 21.58438 37.67256 46.78645
And it shows the same inaccuracy with "low" (i.e. 10
) compression ratio, but with a compression ratio of 100
, it is fine.
I was actually sure that R will fail with a large population, but here are the results with R and haskell
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 20), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.14904 11.79275 14.68885 16.33256 16.76587 18.45570
[9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 30), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.00000 11.38654 14.89294 16.83902 16.99491 18.45382
[9] 17.70872 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 40), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.00000 11.29545 14.68970 16.62703 17.03191 18.87390
[9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 50), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.00000 10.97669 14.51943 16.57941 17.25110 16.98704
[9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 60), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.00000 10.00710 10.02187 14.68962 20.00000 20.00000
[9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 70), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.00000 10.00900 10.06510 14.47914 20.47230 20.00000
[9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 80), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.00000 10.01259 10.11809 15.13858 20.00000 20.00000
[9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 90), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.00000 10.00000 10.05731 14.79463 20.00000 20.00000
[9] 20.00000 20.00000 20.00000
> tquantile(tdigest(sample(c(rep(10, 100000), rep(20, 100000))), 100), c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
[1] 10.00000 10.00000 10.00000 10.00000 10.00057 14.81047 20.00000 20.00000
[9] 20.00000 20.00000 20.00000
Here we have a population of 200_000
values, half of them are 10
and the other half is 20
, and I used sample
to shuffle the insertion order. Except for median which is a bit inaccurate, R tdigest
gives pretty accurate results for compression factor >=60
In comparison, here is haskell TDigest:
ghci> v <- shuffleM $ replicate 100_000 (10 :: Double) ++ replicate 100_000 20
ghci> map (\q -> quantile q (tdigest v :: TDigest 10)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 50)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 100)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 1000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 10000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 100000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
ghci> map (\q -> quantile q (tdigest v :: TDigest 1000000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 11.0,Just 12.0,Just 13.0,Just 14.0,Just 15.0,Just 16.0,Just 17.0,Just 18.0,Just 19.0,Just 20.0]
Actually, it is funny that the median is perfect, but quantiles different than 0, 0.5, 1
seems linearly interpolated. Note how the compression factor have NO impact on the result.
Just adding on 0
to input completely shift the results: (see the (0:)
)
ghci> map (\q -> quantile q (tdigest (0:v) :: TDigest 1000000)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 0.0,Just 6.99991,Just 8.99992,Just 10.999929999999999,Just 12.999940000000002,Just 14.99995,Just 15.999979999999999,Just 16.999985,Just 17.99999,Just 18.999995,Just 20.0]
Tested with:
edit: just before submitting the issue, I realized that it exists the Data.TDigest.Vector
alternative implementation, that I had NOT tested before.
Guess what? Results ARE different are more accurate:
ghci> map (\q -> quantile q (tdigest v :: TDigest 10)) [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
[Just 10.0,Just 10.0,Just 10.0,Just 10.0,Just 10.0,Just 15.000000000000002,Just 20.0,Just 20.0,Just 20.0,Just 20.0,Just 20.0]
Questions
- Why is
TDigest.Tree
different that TDigest.Vector
?
- Why is the compression factor on
TDigest.Tree
have NO effect on this problem (Note that TDigest.Vector
seems to handle this case perfectly WHATEVER the compression factor, I need to check with a less arbitrary population).