scipy.stats.pearsonr¶

scipy.stats.
pearsonr
(x, y)[source]¶ Pearson correlation coefficient and pvalue for testing noncorrelation.
The Pearson correlation coefficient [1] measures the linear relationship between two datasets. The calculation of the pvalue relies on the assumption that each dataset is normally distributed. (See Kowalski [3] for a discussion of the effects of nonnormality of the input on the distribution of the correlation coefficient.) Like other correlation coefficients, this one varies between 1 and +1 with 0 implying no correlation. Correlations of 1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The pvalue roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.
 Parameters
 x(N,) array_like
Input array.
 y(N,) array_like
Input array.
 Returns
 rfloat
Pearson’s correlation coefficient.
 pvaluefloat
Twotailed pvalue.
 Warns
 PearsonRConstantInputWarning
Raised if an input is a constant array. The correlation coefficient is not defined in this case, so
np.nan
is returned. PearsonRNearConstantInputWarning
Raised if an input is “nearly” constant. The array
x
is considered nearly constant ifnorm(x  mean(x)) < 1e13 * abs(mean(x))
. Numerical errors in the calculationx  mean(x)
in this case might result in an inaccurate calculation of r.
See also
spearmanr
Spearman rankorder correlation coefficient.
kendalltau
Kendall’s tau, a correlation measure for ordinal data.
Notes
The correlation coefficient is calculated as follows:
\[r = \frac{\sum (x  m_x) (y  m_y)} {\sqrt{\sum (x  m_x)^2 \sum (y  m_y)^2}}\]where \(m_x\) is the mean of the vector \(x\) and \(m_y\) is the mean of the vector \(y\).
Under the assumption that \(x\) and \(m_y\) are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient \(r\) is ([1], [2]):
\[f(r) = \frac{{(1r^2)}^{n/22}}{\mathrm{B}(\frac{1}{2},\frac{n}{2}1)}\]where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. This is the distribution that is used in
pearsonr
to compute the pvalue. The distribution is a beta distribution on the interval [1, 1], with equal shape parameters a = b = n/2  1. In terms of SciPy’s implementation of the beta distribution, the distribution of r is:dist = scipy.stats.beta(n/2  1, n/2  1, loc=1, scale=2)
The pvalue returned by
pearsonr
is a twosided pvalue. For a given sample with correlation coefficient r, the pvalue is the probability that abs(r’) of a random sample x’ and y’ drawn from the population with zero correlation would be greater than or equal to abs(r). In terms of the objectdist
shown above, the pvalue for a given r and length n can be computed as:p = 2*dist.cdf(abs(r))
When n is 2, the above continuous distribution is not welldefined. One can interpret the limit of the beta distribution as the shape parameters a and b approach a = b = 0 as a discrete distribution with equal probability masses at r = 1 and r = 1. More directly, one can observe that, given the data x = [x1, x2] and y = [y1, y2], and assuming x1 != x2 and y1 != y2, the only possible values for r are 1 and 1. Because abs(r’) for any sample x’ and y’ with length 2 will be 1, the twosided pvalue for a sample of length 2 is always 1.
References
 1(1,2)
“Pearson correlation coefficient”, Wikipedia, https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
 2
Student, “Probable error of a correlation coefficient”, Biometrika, Volume 6, Issue 23, 1 September 1908, pp. 302310.
 3
C. J. Kowalski, “On the Effects of NonNormality on the Distribution of the Sample ProductMoment Correlation Coefficient” Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 21, No. 1 (1972), pp. 112.
Examples
>>> from scipy import stats >>> a = np.array([0, 0, 0, 1, 1, 1, 1]) >>> b = np.arange(7) >>> stats.pearsonr(a, b) (0.8660254037844386, 0.011724811003954649)
>>> stats.pearsonr([1, 2, 3, 4, 5], [10, 9, 2.5, 6, 4]) (0.7426106572325057, 0.1505558088534455)