SciPy

scipy.stats.ks_2samp

scipy.stats.ks_2samp(data1, data2, alternative='two-sided', mode='auto')[source]

Compute the Kolmogorov-Smirnov statistic on 2 samples.

This is a two-sided test for the null hypothesis that 2 independent samples are drawn from the same continuous distribution. The alternative hypothesis can be either ‘two-sided’ (default), ‘less’ or ‘greater’.

Parameters
data1, data2array_like, 1-Dimensional

Two arrays of sample observations assumed to be drawn from a continuous distribution, sample sizes can be different.

alternative{‘two-sided’, ‘less’, ‘greater’}, optional

Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):

  • ‘two-sided’

  • ‘less’: one-sided, see explanation in Notes

  • ‘greater’: one-sided, see explanation in Notes

mode{‘auto’, ‘exact’, ‘asymp’}, optional

Defines the method used for calculating the p-value. The following options are available (default is ‘auto’):

  • ‘auto’ : use ‘exact’ for small size arrays, ‘asymp’ for large

  • ‘exact’ : use exact distribution of test statistic

  • ‘asymp’ : use asymptotic distribution of test statistic

Returns
statisticfloat

KS statistic.

pvaluefloat

Two-tailed p-value.

Notes

This tests whether 2 samples are drawn from the same distribution. Note that, like in the case of the one-sample KS test, the distribution is assumed to be continuous.

In the one-sided test, the alternative is that the empirical cumulative distribution function F(x) of the data1 variable is “less” or “greater” than the empirical cumulative distribution function G(x) of the data2 variable, F(x)<=G(x), resp. F(x)>=G(x).

If the KS statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same.

If the mode is ‘auto’, the computation is exact if the sample sizes are less than 10000. For larger sizes, the computation uses the Kolmogorov-Smirnov distributions to compute an approximate value.

The ‘two-sided’ ‘exact’ computation computes the complementary probability and then subtracts from 1. As such, the minimum probability it can return is about 1e-16. While the algorithm itself is exact, numerical errors may accumulate for large sample sizes. It is most suited to situations in which one of the sample sizes is only a few thousand.

We generally follow Hodges’ treatment of Drion/Gnedenko/Korolyuk [1].

References

1

Hodges, J.L. Jr., “The Significance Probability of the Smirnov Two-Sample Test,” Arkiv fiur Matematik, 3, No. 43 (1958), 469-86.

Examples

>>> from scipy import stats
>>> np.random.seed(12345678)  #fix random seed to get the same result
>>> n1 = 200  # size of first sample
>>> n2 = 300  # size of second sample

For a different distribution, we can reject the null hypothesis since the pvalue is below 1%:

>>> rvs1 = stats.norm.rvs(size=n1, loc=0., scale=1)
>>> rvs2 = stats.norm.rvs(size=n2, loc=0.5, scale=1.5)
>>> stats.ks_2samp(rvs1, rvs2)
(0.20833333333333334, 5.129279597781977e-05)

For a slightly different distribution, we cannot reject the null hypothesis at a 10% or lower alpha since the p-value at 0.144 is higher than 10%

>>> rvs3 = stats.norm.rvs(size=n2, loc=0.01, scale=1.0)
>>> stats.ks_2samp(rvs1, rvs3)
(0.10333333333333333, 0.14691437867433876)

For an identical distribution, we cannot reject the null hypothesis since the p-value is high, 41%:

>>> rvs4 = stats.norm.rvs(size=n2, loc=0.0, scale=1.0)
>>> stats.ks_2samp(rvs1, rvs4)
(0.07999999999999996, 0.41126949729859719)

Previous topic

scipy.stats.ks_1samp

Next topic

scipy.stats.epps_singleton_2samp