scipy.stats.bootstrap#
- scipy.stats.bootstrap(data, statistic, *, n_resamples=9999, batch=None, vectorized=True, paired=False, axis=0, confidence_level=0.95, method='BCa', random_state=None)[source]#
- Compute a two-sided bootstrap confidence interval of a statistic. - When method is - 'percentile', a bootstrap confidence interval is computed according to the following procedure.- Resample the data: for each sample in data and for each of n_resamples, take a random sample of the original sample (with replacement) of the same size as the original sample. 
- Compute the bootstrap distribution of the statistic: for each set of resamples, compute the test statistic. 
- Determine the confidence interval: find the interval of the bootstrap distribution that is - symmetric about the median and 
- contains confidence_level of the resampled statistic values. 
 
 - While the - 'percentile'method is the most intuitive, it is rarely used in practice. Two more common methods are available,- 'basic'(‘reverse percentile’) and- 'BCa'(‘bias-corrected and accelerated’); they differ in how step 3 is performed.- If the samples in data are taken at random from their respective distributions \(n\) times, the confidence interval returned by - bootstrapwill contain the true value of the statistic for those distributions approximately confidence_level\(\, \times \, n\) times.- Parameters
- datasequence of array-like
- Each element of data is a sample from an underlying distribution. 
- statisticcallable
- Statistic for which the confidence interval is to be calculated. statistic must be a callable that accepts - len(data)samples as separate arguments and returns the resulting statistic. If vectorized is set- True, statistic must also accept a keyword argument axis and be vectorized to compute the statistic along the provided axis.
- n_resamplesint, default: 9999
- The number of resamples performed to form the bootstrap distribution of the statistic. 
- batchint, optional
- The number of resamples to process in each vectorized call to statistic. Memory usage is O(batch`*``n`), where - nis the sample size. Default is- None, in which case- batch = n_resamples(or- batch = max(n_resamples, n)for- method='BCa').
- vectorizedbool, default: True
- If vectorized is set - False, statistic will not be passed keyword argument axis, and is assumed to calculate the statistic only for 1D samples.
- pairedbool, default: False
- Whether the statistic treats corresponding elements of the samples in data as paired. 
- axisint, default: 0
- The axis of the samples in data along which the statistic is calculated. 
- confidence_levelfloat, default: 0.95
- The confidence level of the confidence interval. 
- method{‘percentile’, ‘basic’, ‘bca’}, default: 'BCa'
- Whether to return the ‘percentile’ bootstrap confidence interval ( - 'percentile'), the ‘reverse’ or the bias-corrected and accelerated bootstrap confidence interval (- 'BCa'). Note that only- 'percentile'and- 'basic'support multi-sample statistics at this time.
- random_state{None, int, numpy.random.Generator,
- numpy.random.RandomState}, optional- Pseudorandom number generator state used to generate resamples. - If random_state is - None(or np.random), the- numpy.random.RandomStatesingleton is used. If random_state is an int, a new- RandomStateinstance is used, seeded with random_state. If random_state is already a- Generatoror- RandomStateinstance then that instance is used.
 
- Returns
- resBootstrapResult
- An object with attributes: - confidence_intervalConfidenceInterval
- The bootstrap confidence interval as an instance of - collections.namedtuplewith attributes low and high.
- standard_errorfloat or ndarray
- The bootstrap standard error, that is, the sample standard deviation of the bootstrap distribution 
 
 
- Warns
- DegenerateDataWarning
- Generated when - method='BCa'and the bootstrap distribution is degenerate (e.g. all elements are identical).
 
 - Notes - Elements of the confidence interval may be NaN for - method='BCa'if the bootstrap distribution is degenerate (e.g. all elements are identical). In this case, consider using another method or inspecting data for indications that other analysis may be more appropriate (e.g. all observations are identical).- References - 1
- B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall/CRC, Boca Raton, FL, USA (1993) 
- 2
- Nathaniel E. Helwig, “Bootstrap Confidence Intervals”, http://users.stat.umn.edu/~helwig/notes/bootci-Notes.pdf 
- 3
- Bootstrapping (statistics), Wikipedia, https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29 
 - Examples - Suppose we have sampled data from an unknown distribution. - >>> import numpy as np >>> rng = np.random.default_rng() >>> from scipy.stats import norm >>> dist = norm(loc=2, scale=4) # our "unknown" distribution >>> data = dist.rvs(size=100, random_state=rng) - We are interested int the standard deviation of the distribution. - >>> std_true = dist.std() # the true value of the statistic >>> print(std_true) 4.0 >>> std_sample = np.std(data) # the sample statistic >>> print(std_sample) 3.9460644295563863 - We can calculate a 90% confidence interval of the statistic using - bootstrap.- >>> from scipy.stats import bootstrap >>> data = (data,) # samples must be in a sequence >>> res = bootstrap(data, np.std, confidence_level=0.9, ... random_state=rng) >>> print(res.confidence_interval) ConfidenceInterval(low=3.57655333533867, high=4.382043696342881) - If we sample from the distribution 1000 times and form a bootstrap confidence interval for each sample, the confidence interval contains the true value of the statistic approximately 900 times. - >>> n_trials = 1000 >>> ci_contains_true_std = 0 >>> for i in range(n_trials): ... data = (dist.rvs(size=100, random_state=rng),) ... ci = bootstrap(data, np.std, confidence_level=0.9, n_resamples=1000, ... random_state=rng).confidence_interval ... if ci[0] < std_true < ci[1]: ... ci_contains_true_std += 1 >>> print(ci_contains_true_std) 875 - Rather than writing a loop, we can also determine the confidence intervals for all 1000 samples at once. - >>> data = (dist.rvs(size=(n_trials, 100), random_state=rng),) >>> res = bootstrap(data, np.std, axis=-1, confidence_level=0.9, ... n_resamples=1000, random_state=rng) >>> ci_l, ci_u = res.confidence_interval - Here, ci_l and ci_u contain the confidence interval for each of the - n_trials = 1000samples.- >>> print(ci_l[995:]) [3.77729695 3.75090233 3.45829131 3.34078217 3.48072829] >>> print(ci_u[995:]) [4.88316666 4.86924034 4.32032996 4.2822427 4.59360598] - And again, approximately 90% contain the true value, - std_true = 4.- >>> print(np.sum((ci_l < std_true) & (std_true < ci_u))) 900 - bootstrapcan also be used to estimate confidence intervals of multi-sample statistics, including those calculated by hypothesis tests.- scipy.stats.moodperform’s Mood’s test for equal scale parameters, and it returns two outputs: a statistic, and a p-value. To get a confidence interval for the test statistic, we first wrap- scipy.stats.moodin a function that accepts two sample arguments, accepts an axis keyword argument, and returns only the statistic.- >>> from scipy.stats import mood >>> def my_statistic(sample1, sample2, axis): ... statistic, _ = mood(sample1, sample2, axis=-1) ... return statistic - Here, we use the ‘percentile’ method with the default 95% confidence level. - >>> sample1 = norm.rvs(scale=1, size=100, random_state=rng) >>> sample2 = norm.rvs(scale=2, size=100, random_state=rng) >>> data = (sample1, sample2) >>> res = bootstrap(data, my_statistic, method='basic', random_state=rng) >>> print(mood(sample1, sample2)[0]) # element 0 is the statistic -5.521109549096542 >>> print(res.confidence_interval) ConfidenceInterval(low=-7.255994487314675, high=-4.016202624747605) - The bootstrap estimate of the standard error is also available. - >>> print(res.standard_error) 0.8344963846318795 - Paired-sample statistics work, too. For example, consider the Pearson correlation coefficient. - >>> from scipy.stats import pearsonr >>> n = 100 >>> x = np.linspace(0, 10, n) >>> y = x + rng.uniform(size=n) >>> print(pearsonr(x, y)[0]) # element 0 is the statistic 0.9962357936065914 - We wrap - pearsonrso that it returns only the statistic.- >>> def my_statistic(x, y): ... return pearsonr(x, y)[0] - We call - bootstrapusing- paired=True. Also, since- my_statisticisn’t vectorized to calculate the statistic along a given axis, we pass in- vectorized=False.- >>> res = bootstrap((x, y), my_statistic, vectorized=False, paired=True, ... random_state=rng) >>> print(res.confidence_interval) ConfidenceInterval(low=0.9950085825848624, high=0.9971212407917498)