scipy.stats.pearsonr¶

scipy.stats.
pearsonr
(x, y)[source]¶ Calculate a Pearson correlation coefficient and the pvalue for testing noncorrelation.
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed, and not necessarily zeromean. Like other correlation coefficients, this one varies between 1 and +1 with 0 implying no correlation. Correlations of 1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The pvalue roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The pvalues are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
 Parameters
 x(N,) array_like
Input
 y(N,) array_like
Input
 Returns
 rfloat
Pearson’s correlation coefficient
 pvaluefloat
2tailed pvalue
Notes
The correlation coefficient is calculated as follows:
\[r_{pb} = \frac{\sum (x  m_x) (y  m_y)} {\sqrt{\sum (x  m_x)^2 \sum (y  m_y)^2}}\]where \(m_x\) is the mean of the vector \(x\) and \(m_y\) is the mean of the vector \(y\).
Under the assumption that x and y are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient r is ([1], [2]):
(1  r**2)**(n/2  2) f(r) =  B(1/2, n/2  1)
where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. This is the distribution that is used in
pearsonr
to compute the pvalue. The distribution is a beta distribution on the interval [1, 1], with equal shape parameters a = b = n/2  1. In terms of SciPy’s implementation of the beta distribution, the distribution of r is:dist = scipy.stats.beta(n/2  1, n/2  1, loc=1, scale=2)
The pvalue returned by
pearsonr
is a twosided pvalue. For a given sample with correlation coefficient r, the pvalue is the probability that abs(r’) of a random sample x’ and y’ drawn from the population with zero correlation would be greater than or equal to abs(r). In terms of the objectdist
shown above, the pvalue for a given r and length n can be computed as:p = 2*dist.cdf(abs(r))
When n is 2, the above continuous distribution is not welldefined. One can interpret the limit of the beta distribution as the shape parameters a and b approach a = b = 0 as a discrete distribution with equal probability masses at r = 1 and r = 1. More directly, one can observe that, given the data x = [x1, x2] and y = [y1, y2], and assuming x1 != x2 and y1 != y2, the only possible values for r are 1 and 1. Because abs(r’) for any sample x’ and y’ with length 2 will be 1, the twosided pvalue for a sample of length 2 is always 1.
References
 1
“Pearson correlation coefficient”, Wikipedia, https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
 2
Student, “Probable error of a correlation coefficient”, Biometrika, Volume 6, Issue 23, 1 September 1908, pp. 302310.
Examples
>>> from scipy import stats >>> a = np.array([0, 0, 0, 1, 1, 1, 1]) >>> b = np.arange(7) >>> stats.pearsonr(a, b) (0.8660254037844386, 0.011724811003954654)
>>> stats.pearsonr([1,2,3,4,5], [5,6,7,8,7]) (0.83205029433784372, 0.080509573298498519)