cdowd / twosamples Goto Github PK

Fast Permutation Based Two Sample Tests

R 77.04% C++ 22.96%

cran r distance-metric ecdf

twosamples's Introduction

twosamples

The goal of twosamples is to allow easy two-sample testing using permutations. This package offers quick (O(n^2) worst) and memory light (O(n)) code for a number of standard distance metrics between two samples. Using permutations of observed samples to calculate null distributions for those metrics, this package also provides assumption free p-values for the samples coming from the same distribution.

Details about the DTS statistic’s calculation, and performance can be found here.

Example R code which works to build test functions can be found on github here. This code uses the release v1.0.0 algorithm for simplicity, and is in R rather than C++, so is substantially slower.

Installation

You can install the released version of twosamples from CRAN with:

install.packages("twosamples")

Or you can install the most recent (possibly unstable) version from github:

library(remotes)
install_github("cdowd/twosamples")

System Requirements

Going forward (v2.0.0 onwards), twosamples depends on having C++11 installed. This is true of most R platforms and available for all of them. Compilation also requires headers from the cpp11 package.

Example

In this example, we have 100 observations from two different Normal distributions.

library(twosamples)
vec1 = rnorm(100)
vec2 = rnorm(100,0.5)
output = two_sample(vec1,vec2)
output
#> Test Stat   P-Value 
#>  11.33121   0.01500
summary(output)
#> DTS Test 
#> =========================
#> Test Statistic: 11.33121 
#>        P-Value: 0.015 *
#> - - - - - - - - - - - - -
#>      n1      n2 n.boots 
#>     100     100    2000 
#> =========================
#> Test stat rejection threshold for alpha = 0.05 is: 9.949129 
#> Null rejected: samples are from different distributions
plot(output)

Metric Example Calculations

This section will review each of the different tests offered. I’ll offer a brief description of the test mathematically, then some quick intuition. After that, I’ll show R code that would build the test statistic – to provide a full description. That code is not the code underlying this package, but a C++ rewrite of it is.

Each of these test statistics looks at the ECDFs which samples correspond to. I’ll refer to E(x) as the ECDF of sample 1, F(x) as the ECDF of sample 2, and G(x) as the ECDF of the joint sample. The ECDF of course takes a real input and returns a value corresponding to the portion of the sample which was observed less than or equal to that point.

Kolmogorov-Smirnov Test

The KS test finds the largest (vertical) difference between the two ECDFs. See ks_test().

Demonstration R Code Calculating Test Statistic

ks_stat_R = function(vec1,vec2,power=1) {
  n1 = length(vec1)
  n2 = length(vec2)
  n = n1+n2
  
  joint.sample = c(vec1,vec2)
  e = c(rep(1/n1,n1),rep(0,   n2))
  f = c(rep(0,   n1),rep(1/n2,n2))
  
  ind = order(joint.sample)
  d = joint.sample[ind]
  e = e[ind]
  f = f[ind]
  
  out = 0
  Ecur = 0
  Fcur = 0
  height = 0
  for (i in 1:(n-1)) {
    Ecur = Ecur + e[i]
    Fcur = Fcur + f[i]
    if (d[i] != d[i+1]) height = abs(Fcur-Ecur)
    if (height > out) out = height
  }
  out^power
}

In the example plot below, the KS statistic is the height of the vertical black line.

Kuiper Test

The Kuiper test is much the same as Kolmogorov-Smirnov, but it sums the largest difference in each direction. i.e. it cares about the difference between E(x)-F(x) and F(x)-E(x). In some sense this should be trading some power against mean-shifts for power against variance changes. See kuiper_test()

Demonstration R Code Calculating Test Statistic

kuiper_stat_R = function(vec1,vec2,power=1) { 
  n1 = length(vec1)
  n2 = length(vec2)
  n = n1+n2
  
  joint.sample = c(vec1,vec2)
  e = c(rep(1/n1,n1),rep(0,   n2))
  f = c(rep(0,   n1),rep(1/n2,n2))
  
  ind = order(joint.sample)
  d = joint.sample[ind]
  e = e[ind]
  f = f[ind]
  
  up = 0
  down = 0
  Ecur = 0
  Fcur = 0
  height = 0
  for (i in 1:(n-1)) {
    Ecur = Ecur + e[i]
    Fcur = Fcur + f[i]
    if (d[i] != d[i+1]) height = Fcur-Ecur
    if (height > up) up = height
    if (height < down) down = height
  }
  abs(down)^power + abs(up)^power
}

In the example plot below, the Kuiper statistic is the sum of the heights of the vertical black lines.

Cramer-Von Mises

The Cramer-Von Mises criterion further extends the intuition of Kuiper and KS. It is actually the full sum across every observation X of the difference |F(x)-E(x)|. This use of the full joint sample gives it a substantial power gain, particularly against higher moments shifting. See cvm_test().

Demonstration R Code Calculating Test Statistic

cvm_stat_R = function(vec1,vec2,power=2){
  n1 = length(vec1)
  n2 = length(vec2)
  n = n1+n2
  
  joint.sample = c(vec1,vec2)
  e = c(rep(1/n1,n1),rep(0,   n2))
  f = c(rep(0,   n1),rep(1/n2,n2))
  
  ind = order(joint.sample)
  d = joint.sample[ind]
  e = e[ind]
  f = f[ind]
  
  out = 0
  Ecur = 0
  Fcur = 0
  height = 0
  dups = 1
  for (i in 1:(n-1)) {
    Ecur = Ecur + e[i]
    Fcur = Fcur + f[i]
    height = abs(Fcur-Ecur)
    if (d[i] != d[i+1]) {
      out = out + (height^power)*dups
      dups = 1
    } else if (d[i] == d[i+1]) {
      dups = dups+1
    }
  }
  out
}

In the example plot below, the CVM statistic is the sum of the heights of the vertical black lines.

Anderson-Darling

Anderson-Darling test starts from the Cramer-Von Mises logic. However, they note that some points on the joint ECDF are higher variance than others. Because there is more noise in those observations, they should receive a lower weight. More than that, we can even describe the ‘optimal’ weighting function – it is closely related to the joint ECDF - G. See ad_test()

Demonstration R Code Calculating Test Statistic

ad_stat_R = function(vec1,vec2,power=2){
  n1 = length(vec1)
  n2 = length(vec2)
  n = n1+n2
  
  joint.sample = c(vec1,vec2)
  e = c(rep(1/n1,n1),rep(0,   n2))
  f = c(rep(0,   n1),rep(1/n2,n2))
  
  ind = order(joint.sample)
  d = joint.sample[ind]
  e = e[ind]
  f = f[ind]
  
  out = 0
  Ecur = 0
  Fcur = 0
  Gcur = 0
  height = 0
  dups = 1
  
  for (i in 1:(n-1)) {
    Ecur = Ecur + e[i]
    Fcur = Fcur + f[i]
    Gcur = Gcur+1/n
    sd = (2*Gcur*(1-Gcur)/n)**0.5
    height = abs(Fcur-Ecur)
    if (d[i] != d[i+1]) {
      out = out + ((height/sd)^power)*dups
      dups = 1
    } else if (d[i] == d[i+1]) {
      dups = dups+1
    }
  }
  out
}

In the example plot below, we see the variance of the joint ECDF over the range of the data. It clearly peaks in the middle of the joint sample.

In the example plot below, the AD statistic is the weighted sum of the heights of the vertical lines, where weights are represented by the shading of the lines.

Wasserstein

The Wasserstein distance is not normally a two-sample distance, but the extension is very simple. In terms of the ECDFs it can actually be described as the area between the ECDFs. Intuitively, the improvement this offers over CVM is that it allows more extreme values to change our conclusions. That is to say – KS, Kuiper, CVM, and AD have all operated on ordinal variables, with no sense of the distance between different rank order observations. Wasserstein (and below DTS) will require interval data. By utilizing the extra information encoded when information is from interval data, Wasserstein and DTS will improve on CVM and AD. See wass_test()

Demonstration R Code Calculating Test Statistic

wass_stat_R = function(vec1,vec2,power=1) {
  n1 = length(vec1)
  n2 = length(vec2)
  n = n1+n2
  
  joint.sample = c(vec1,vec2)
  e = c(rep(1/n1,n1),rep(0,   n2))
  f = c(rep(0,   n1),rep(1/n2,n2))
  
  ind = order(joint.sample)
  d = joint.sample[ind]
  e = e[ind]
  f = f[ind]
  
  out = 0
  Ecur = 0
  Fcur = 0
  height = 0
  width = 0
  for (i in 1:(n-1)) {
    Ecur = Ecur + e[i]
    Fcur = Fcur + f[i]
    height = abs(Fcur-Ecur)
    width = d[i+1]-d[i]
    out = out + (height^power)*width
  }
  out
}

In the example plot below, the Wasserstein statistic is the shaded area between the ECDFs.

DTS/two_sample

If the Wasserstein metric improves on CVM by moving it into the realm of interval data, then DTS improves on AD by doing the same. Alternately – DTS offers the same improvement over Wasserstein that AD offers over CVM. See dts_test() (AKA two_sample() ).

Demonstration R Code Calculating Test Statistic

dts_stat_R = function(vec1,vec2,power=1) {
  n1 = length(vec1)
  n2 = length(vec2)
  n = n1+n2
  
  joint.sample = c(vec1,vec2)
  e = c(rep(1/n1,n1),rep(0,   n2))
  f = c(rep(0,   n1),rep(1/n2,n2))
  
  ind = order(joint.sample)
  d = joint.sample[ind]
  e = e[ind]
  f = f[ind]
  
  out = 0
  Ecur = 0
  Fcur = 0
  Gcur = 0
  height = 0
  width = 0 
  for (i in 1:(n-1)) {
    Ecur = Ecur + e[i]
    Fcur = Fcur + f[i]
    Gcur = Gcur+1/n
    sd = (2*Gcur*(1-Gcur)/n)**0.5
    height = abs(Fcur-Ecur)
    width = d[i+1]-d[i]
    out = out + ((height/sd)^power)*width
  }
  out
}

In the example plot below, the DTS statistic is the shaded area between the ECDFs, weighted by the variances – shown by the color of the shading.

Permutation Testing

Once we have a metric measuring distance between two samples, its easy enough to code up a testing procedure.

Specifically we want to test the following: * H0: J=K

where Sample 1 is distributed i.i.d. from J and Sample 2 is i.i.d. from K.

Broadly, under this null, observations from sample 1 and sample 2 are exchangeable.That is – by randomly swapping observations between the samples, we can generate a new, equally likely data set from the same (Null) DGP. Then we can calculate our metric for distance between the samples on this data set. By repeating this procedure many times, we can generate a null distribution for our distance metric. We can then compare our observed distance to the null distribution. Because distances are non-negative, this will be a one-sided comparison. (Though if you wanted to test whether a sample was hyper-regular you could do that by looking at the one-sided test to the bottom).

Moreover, unlike other, more general, two-sample procedures we do not rely on any properties of the underlying distribution. Some procedures require continuity, derivatives, or smoothness to get their asymptotic and finite sample guarantees. Those are procedures which estimate the underlying density, rather than simply working working the ECDF. The testing procedures in this package only rely on the statement that observations are i.i.d. The Independence is necessary for the exchangeability statement to hold. The identicality is necessary for the null hypothesis to be a sensible claim.

twosamples's People

Contributors

Stargazers

Watchers

Forkers

brenthueth juyeongkim rnaimehaom djhammill

twosamples's Issues

Documentation of dts stat is lacking

Without anything formal to cite, this needs substantially better documentation

Possible Speedups

Looks like speculative evaluation sometimes slows down code execution in situations where there is branching.

See e.g. https://unomerite.medium.com/on-summing-integers-608b2063583

Worth doing some testing to see if "if" blocks inside the main loops are behaving in this manner.
It is possible that behavior depends on the system, but the proposed multiplication style path forwards looks like it should be fairly universally quick anyhow. In particular, a factor of 2 speed improvement in exchange for a some systems factor of 20% slowdown is not crazy.

Specifically, sign reversals are likely to benefit from this substantially.

I am not capable of maintaining different code for different systems. And in many situations, this is possibly a zero-sum improvement.

Priority: medium.

Possible power improvements from own-sample resampling instead of joint?

Certainly valid under the null. Size control might get weird when one of the samples is small. Power is not obvious. Need to think more.

Comparisons to known null distributions currently difficult

Is this worth building?

The difference between just sampling from a known null and this is that we can gain power by comparing samples from the null against themselves.

Automated testing needs attention

Partial list:

testing of duplicates is inadequate (e.g. should ensure we test that KS/kuiper don't grab a sequential max ruled out because it is a duplicate)
testing of kuiper requires abs?

How the stat and P-value are calculated for each function?

I would like to know the stat and P-value are calculated inside each distance function? Is there any reference for their validity?

CItation of the package

Hello,

I'm using the package for my reasearch and I want to know how to cite it properly.

Thank you,

best,

Gustavo.

what is full form or expansion name of DTS

Hi,
found you paper very helpful. One very basic question is what does DTS stands for ?
PS : Not the correct place to open in an issue. Apologies.

Memory Usage of Test_Stat functions is O(7N) and could be O(4N)

Each of the test stat functions has vectors d, e, f, dd, ee, ff, each of which is length N. The ones with double letters are merely the ones with single letters, after sorting. The single lettered versions are never used again after sorting, but remain in memory until the function exits. Should be able to drop the double lettered versions, and replace the single letter versions with their sorted counterparts? (possibly some kind of pointer issue here -- will test) If that works, we can nearly halve the memory usage.

Weighting schemes

Should be able to incorporate observation weights with ease. This has been requested by several users.

Calculation of the AD statistics

Hi,

Thanks for creating the package!

I'm running some tests on the AD test, and have a question on the calculation of the AD statistics.
According to the README.md or ad_test.Rd file, the AD statistics is calculated in the following way

AD = \sum_{x \in k} \left({|E(x)-F(x)| \over \sqrt{2G(x)(1-G(x))/n} }\right)^p

It seems to me that there may be two issues: 1) the formula assumes the two samples sizes are the same; and 2) the approximation of the integral is not correctly calculated.

Let the sample sizes be n1 and n2, with corresponding ecdf E and F in your notation; n=n1+n2 and G be the ecdf of the joint, when p=2,
$AD = \frac{n1\times n2}{n} \int (E(x)-F(x))^2 / (G(x)(1-G(x))) d G(x)$
see F. W. Scholz, M. A. Stephens, (1987) K-Sample Anderson-Darling Tests

Let x_i denote the data in the joint sample, then the integral should be approximated by
$\frac{1}{n} \sum_{i \in [n]} \frac{(E(x_i)-F(x_i))^2}{(G(x_i)(1-G(x_i)))}. $
Recall that there is extra $n1*n2/n$, if you make $n1=n2=n/2$,
$AD = \frac{1}{4} \sum_{i \in [n]} \frac{(E(x_i)-F(x_i))^2}{(G(x_i)(1-G(x_i)))},$
which is different from your formula (extra $n$ is multiplied there).

Plus, tried with some simple datasets, the Test Stat returned from ad_test is related to the total sample size.

Please let me know if this makes sense, or if I am wrong.

Thanks!

Plotting defaults?

It would be nice if there was a good plot default we could create. Two obvious options arise -- plotting what the test stat measures, and plotting the test stat values.

Anderson Darling and DTS Weights?

Hi,

Hope all is well. I've found this package and its accompanying paper incredibly helpful, thank you. I implemented your ideas in some python code to avoid hoping into R just to perform these tests. I'm confused by how you construct weights for the Anderson Darling and DTS distance metrics. From my point of view, all your written work suggests the weights for both statistics should be: (Gcur*(1-Gcur)). Instead you are using the following weight: (n*Gcur*(1-Gcur))**0.5. I haven't thought through or tested if the difference matters. I'm curious though why the difference exists. I would love to hear your reasoning, no worries if you don't have the time.

Thanks!

inconsistent behaviour that depends on nboots

For values of nboots >65346 the functions return an object with various attributes attached to it, including all the bootstrap results. Below this values the function only returns the test score and pval. I'm guessing this is unintended behaviour but it would be nice to extend this to all values of nboots as it makes it easier to make ggplot plots out of the results.

keep.boots, and keep.samples flags dont appear to have any impact on this. It also works for all the functions included in this library: dts_test, cvm_test, kst_test

order_cpp needs improvement

Hi,

I was looking into speeding up order_cpp but stumbled upon a seemingly incorrect behaviour, showing different results from order() and order_cpp()

> library(twosamples)
> x = c(9, 7, 8)
> order(x)
[1] 2 3 1
> order_cpp(x) + 1              # add one to correct zero-based C++ indexing
[1] 3 1 2                       # expected order: 2 3 1
>

My suggestion would be to use the STL implementation instead:

IntegerVector order_cpp_stl(NumericVector x) {
  int n = x.size();
  IntegerVector ord(n);
  for (int i = 0 ; i < n ; i++) {
    ord[i] = i;
  }
  // Sort indices using the values from x
  std::sort(ord.begin(), ord.end(),
       [&x](const int& a, const int& b) {
         return (x[a] < x[b]);
       });
  return ord;
}

which should not only be correct but also faster:

> z <- rnorm(10000)
> stopifnot(all.equal(order_cpp_stl(z) + 1, order(z)))       # see above order_cpp_stl()
> print(microbenchmark(order_cpp(z), order_cpp_stl(z)), unit="ms", signif=3)
Unit: milliseconds
             expr     min      lq        mean  median     uq     max neval
     order_cpp(z) 325.000 325.000 325.7930555 326.000 326.00 329.000   100
 order_cpp_stl(z)   0.547   0.551   0.5578429   0.556   0.56   0.733   100

I am sorry for bugging you again - I didn't mean to distract you from your dissertation, but it would be great to verify the above behaviour and post a quick fix, if confirmed. After you are done with your thesis, we can revisit this and maybe talk about adding some unit tests with testthat,tinytest or whatever is appropriate and/or lightweight enough.

Thank you!

Parallelization?

This seems like overkill? But for really large samples it could be useful.

On the other hand, you could let people do that for themselves. pval + nboots tells us exactly how many it was bigger than, and we could run that same bit of code a hundred times across cores, and calculate the resulting pval easily.

Seems like I could just explain that inside the manual. Lets not add a dependency on parallel or something ridiculous.

Duplicates Mistreated in `ad_stat` and `cvm_stat` -- FIXED

This is a retrospective issue. The problem has been dealt with in release v1.2.0.

This whole issue is derived from comments by Christian Fiedler (University of Toronto Stats). I greatly appreciate that assistance.

Summary

Technical challenges caused a code change that created a mismatch between the sampling routine and the test-statistic such that two test routines were over-powered. This has been fixed in the current release.

Basic problem

The CVM test and AD test both describe themselves as $\sum_{x \in k} |F(x)-E(x)|^p w(x)$ where $w(x)$ is some weighting function and $k$ is the joint sample.

However, in the prior release (v1.1.1), these two test stats were calculated as $\sum_{unique x \in k} |F(x)-E(x)|^p w(x)$. Adding the restriction to unique values of x was implicit, and quasi-deliberate (see appendix: "Why unique Vals").

Why is this an issue?

In a real sense, this is just a slightly different test statistic, and should not have a big impact. But, because the resampling routine creates many more duplicates (resampling with replacement), the bootstrapped null distribution was implicitly summing over fewer observations -- making for lower output values. This meant that when calculating a p-value by comparing the test statistic to the bootstrapped test statistics, we were comparing the test statistic to artificially deflated values. This in turn made the p-values more significant than they ought to have been.

This was a more substantial issue for smaller samples and for distributions that are non-discrete. Distributions with many pre-existing duplicates already have a limit on the number of summands and were relatively unaffected.

Fix

One proposed fix is to drop the equality check in the code. This causes other problems (see appendix: "Why unique Vals").

Instead, the actual fix creates a new counter that starts at 1 and increments for each duplicated value. When a value is finally added to the output, that value is multiplied by this duplicates counter -- thus counting the gap in ECDFs once for each observation of x.

Appendix: Why unique Vals?

Simply put, the code operates by running through sorted observations and adding to the output for each observation. Observations that are duplicates are tricky because when the code is looking at them, the actual gap $|F(x)-E(x)|$ is not yet known -- it will be found once all the duplicated values are tallied. If we add to the sum at this time then there are two problems: 1) the code still isn't doing the test statistic -- but a different near relative. 2) the order in which duplicates wind up being sorted can matter, which means that the test statistic ad_test(vec1,vec2) will not equal ad_test(vec2,vec1). Thus the code deliberately dropped all duplicates but one from the sum. However, this caused the underweighting problem discussed above.

One-sided alternatives of stat functions?

Hi,

I wonder if it would be useful to add an option for one-sided statistics functions? That would be similar in spirit to the alternative param of the t.test function, with the two.sided option being the default.

There are about half a dozen places in the code where the sign of the difference between two CDFs is corrected:

    if (height < 0.0) {
      height = -1.0*height;
    }

whereas In the one-sided alternative the negative height would be simply set to zero.

Any thoughts?

Thanks!

combine.twosamples fails tests occasionally

Example:

CRAN Package Check Results for Package twosamples.pdf

This is an intermittent issue.
Cause seems to be the p-value rounding when no bootstraps exceed the observed test statistic. The package defaults to a p-value of 1/(2*nboots).

In testing, this means a value of 1/4000 = 0.00025 =~ 0.0003 for each of the two sets of statistics being combined, which when averaged stays the same -- the 'expected value' above. For the combined output however ("actual"), with no observations, we get a value of 1/8000 = 0.000125 =~ 0.0001. These are different, the test fails.

Possible solutions:

rewrite the test to accommodate this situation: either look at the actual bootstrap values, or ditch the test when one count is 0.
rewrite the underlying implementation so that the 1/(2*nboots) pvalue is what is printed but is not what is stored.
remove the test

Plan: solution 1

Priority: HIGH

this is a small issue that is unlikely to affect a single user. In a real sense the behavior is as desired. The test just was overly simple
EXCEPT: that it causes CRAN tests to fail occasionally, which could get the package dumped.

Expect to resolve this by end of the month.

cdowd / twosamples Goto Github PK

twosamples's Introduction

twosamples

Installation

System Requirements

Example

Metric Example Calculations

Kolmogorov-Smirnov Test

Kuiper Test

Cramer-Von Mises

Anderson-Darling

Wasserstein

DTS/two_sample

Permutation Testing

twosamples's People

Contributors

Stargazers

Watchers

Forkers

twosamples's Issues

Summary

Basic problem

Why is this an issue?

Fix

Appendix: Why unique Vals?

Recommend Projects

Recommend Topics

Recommend Org