Performs a two-sample multivariate Kolmogorov-Smirnov test as described by Fasano and Franceschini (1987). This test evaluates the null hypothesis that two i.i.d. random samples were drawn from the same underlying probability distribution. The data can be of any dimension and of any type (continuous, discrete, or mixed).
fasano.franceschini.test(
S1,
S2,
nPermute = 100,
threads = 1,
seed = NULL,
verbose = TRUE,
method = c("r", "b")
)
A matrix
or data.frame
. Each row represents one
observation.
A matrix
or data.frame
. Each row represents one
observation.
A nonnegative integer
setting the number of
permutations to use for performing the permutation test. Default is 100
.
If set to 0
, only the test statistic is computed.
A positive integer
or "auto"
setting the number
of threads to use during the permutation test. If set to "auto"
, the
number of threads is determined by RcppParallel::defaultNumThreads()
.
Default is 1
.
An optional integer to seed the PRNG used for the permutation test. A seed must be passed to reproducibly compute p-values.
A boolean
indicating whether to display a progress bar.
Default is TRUE
. Only available when threads = 1
.
An optional character
indicating which method to use to
compute the test statistic. The two methods are 'r'
(range tree) and
'b'
(brute force). Both methods return the same results but may vary in
computation speed. If this argument is not passed, the sample sizes and
dimension of the data are used to infer which method is likely faster. See the
Details section for more information.
A list of class htest
containing the following components:
The value of the test statistic.
The permutation test p-value.
The name of the test.
The names of the original data objects.
The test statistic can be computed using two different methods. Both methods return identical results, but have different time complexities:
Range tree method: This method has a time complexity of O(N*log(N)^(d-1)), where N is the size of the larger sample and d is the dimension of the data.
Brute force method: This method has a time complexity of O(N^2).
The range tree method tends to be faster for low dimensional data or large
sample sizes, while the brute force method tends to be faster for high
dimensional data or small sample sizes. When method
is not passed,
the sample sizes and dimension of the data are used to infer which method will
likely be faster. However, as the geometry of the samples can influence
computation time, the method inferred to be faster may not actually be faster. To
perform more comprehensive benchmarking for a specific dataset, nPermute
can be set equal to 0
, which bypasses the permutation test and only
computes the test statistic.
Fasano, G. & Franceschini, A. (1987). A multidimensional version of the Kolmogorov-Smirnov test. Monthly Notices of the Royal Astronomical Society, 225:155-170. doi:10.1093/mnras/225.1.155 .
set.seed(0)
# create 2-D samples
S1 <- data.frame(x = rnorm(n = 20, mean = 0, sd = 1),
y = rnorm(n = 20, mean = 1, sd = 2))
S2 <- data.frame(x = rnorm(n = 40, mean = 0, sd = 1),
y = rnorm(n = 40, mean = 1, sd = 2))
# perform test
fasano.franceschini.test(S1, S2)
#>
#> Fasano-Franceschini Test
#>
#> data: S1 and S2
#> D = 280, p-value = 0.961
#>
# perform test with more permutations
fasano.franceschini.test(S1, S2, nPermute = 150)
#>
#> Fasano-Franceschini Test
#>
#> data: S1 and S2
#> D = 280, p-value = 0.9648
#>
# set seed for reproducible p-value
fasano.franceschini.test(S1, S2, seed = 0)$p.value
#> p-value
#> 0.9761668
fasano.franceschini.test(S1, S2, seed = 0)$p.value
#> p-value
#> 0.9761668
# perform test using range tree method
fasano.franceschini.test(S1, S2, method = 'r')
#>
#> Fasano-Franceschini Test
#>
#> data: S1 and S2
#> D = 280, p-value = 0.9291
#>
# perform test using brute force method
fasano.franceschini.test(S1, S2, method = 'b')
#>
#> Fasano-Franceschini Test
#>
#> data: S1 and S2
#> D = 280, p-value = 0.9594
#>
# perform test using multiple threads to speed up p-value computation
if (FALSE) {
fasano.franceschini.test(S1, S2, threads = 2)
}