scipy.stats.pearsonr¶
-
scipy.stats.
pearsonr
(x, y)[source]¶ Calculate a Pearson correlation coefficient and the p-value for testing non-correlation.
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed, and not necessarily zero-mean. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
- Parameters
- x(N,) array_like
Input
- y(N,) array_like
Input
- Returns
- rfloat
Pearson’s correlation coefficient
- p-valuefloat
2-tailed p-value
Notes
The correlation coefficient is calculated as follows:
\[r_{pb} = \frac{\sum (x - m_x) (y - m_y)} {\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}\]where \(m_x\) is the mean of the vector \(x\) and \(m_y\) is the mean of the vector \(y\).
Under the assumption that x and y are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient r is ([1], [2]):
(1 - r**2)**(n/2 - 2) f(r) = --------------------- B(1/2, n/2 - 1)
where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. This is the distribution that is used in
pearsonr
to compute the p-value. The distribution is a beta distribution on the interval [-1, 1], with equal shape parameters a = b = n/2 - 1. In terms of SciPy’s implementation of the beta distribution, the distribution of r is:dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
The p-value returned by
pearsonr
is a two-sided p-value. For a given sample with correlation coefficient r, the p-value is the probability that abs(r’) of a random sample x’ and y’ drawn from the population with zero correlation would be greater than or equal to abs(r). In terms of the objectdist
shown above, the p-value for a given r and length n can be computed as:p = 2*dist.cdf(-abs(r))
When n is 2, the above continuous distribution is not well-defined. One can interpret the limit of the beta distribution as the shape parameters a and b approach a = b = 0 as a discrete distribution with equal probability masses at r = 1 and r = -1. More directly, one can observe that, given the data x = [x1, x2] and y = [y1, y2], and assuming x1 != x2 and y1 != y2, the only possible values for r are 1 and -1. Because abs(r’) for any sample x’ and y’ with length 2 will be 1, the two-sided p-value for a sample of length 2 is always 1.
References
- 1
“Pearson correlation coefficient”, Wikipedia, https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
- 2
Student, “Probable error of a correlation coefficient”, Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310.
Examples
>>> from scipy import stats >>> a = np.array([0, 0, 0, 1, 1, 1, 1]) >>> b = np.arange(7) >>> stats.pearsonr(a, b) (0.8660254037844386, 0.011724811003954654)
>>> stats.pearsonr([1,2,3,4,5], [5,6,7,8,7]) (0.83205029433784372, 0.080509573298498519)