scipy.linalg.clarkson_woodruff_transform¶
-
scipy.linalg.
clarkson_woodruff_transform
(input_matrix, sketch_size, seed=None)[source]¶ ” Applies a Clarkson-Woodruff Transform/sketch to the input matrix.
Given an input_matrix
A
of size(n, d)
, compute a matrixA'
of size (sketch_size, d) so that\[\|Ax\| \approx \|A'x\|\]with high probability via the Clarkson-Woodruff Transform, otherwise known as the CountSketch matrix.
- Parameters
- input_matrix: array_like
Input matrix, of shape
(n, d)
.- sketch_size: int
Number of rows for the sketch.
- seedNone or int or
numpy.random.mtrand.RandomState
instance, optional This parameter defines the
RandomState
object to use for drawing random variates. If None (ornp.random
), the globalnp.random
state is used. If integer, it is used to seed the localRandomState
instance. Default is None.
- Returns
- A’array_like
Sketch of the input matrix
A
, of size(sketch_size, d)
.
Notes
To make the statement
\[\|Ax\| \approx \|A'x\|\]precise, observe the following result which is adapted from the proof of Theorem 14 of [2] via Markov’s Inequality. If we have a sketch size
sketch_size=k
which is at least\[k \geq \frac{2}{\epsilon^2\delta}\]Then for any fixed vector
x
,\[\|Ax\| = (1\pm\epsilon)\|A'x\|\]with probability at least one minus delta.
This implementation takes advantage of sparsity: computing a sketch takes time proportional to
A.nnz
. DataA
which is inscipy.sparse.csc_matrix
format gives the quickest computation time for sparse input.>>> from scipy import linalg >>> from scipy import sparse >>> n_rows, n_columns, density, sketch_n_rows = 15000, 100, 0.01, 200 >>> A = sparse.rand(n_rows, n_columns, density=density, format='csc') >>> B = sparse.rand(n_rows, n_columns, density=density, format='csr') >>> C = sparse.rand(n_rows, n_columns, density=density, format='coo') >>> D = np.random.randn(n_rows, n_columns) >>> SA = linalg.clarkson_woodruff_transform(A, sketch_n_rows) # fastest >>> SB = linalg.clarkson_woodruff_transform(B, sketch_n_rows) # fast >>> SC = linalg.clarkson_woodruff_transform(C, sketch_n_rows) # slower >>> SD = linalg.clarkson_woodruff_transform(D, sketch_n_rows) # slowest
That said, this method does perform well on dense inputs, just slower on a relative scale.
References
- 1
Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In STOC, 2013.
- 2
David P. Woodruff. Sketching as a tool for numerical linear algebra. In Foundations and Trends in Theoretical Computer Science, 2014.
Examples
Given a big dense matrix
A
:>>> from scipy import linalg >>> n_rows, n_columns, sketch_n_rows = 15000, 100, 200 >>> A = np.random.randn(n_rows, n_columns) >>> sketch = linalg.clarkson_woodruff_transform(A, sketch_n_rows) >>> sketch.shape (200, 100) >>> norm_A = np.linalg.norm(A) >>> norm_sketch = np.linalg.norm(sketch)
Now with high probability, the true norm
norm_A
is close to the sketched normnorm_sketch
in absolute value.Similarly, applying our sketch preserves the solution to a linear regression of \(\min \|Ax - b\|\).
>>> from scipy import linalg >>> n_rows, n_columns, sketch_n_rows = 15000, 100, 200 >>> A = np.random.randn(n_rows, n_columns) >>> b = np.random.randn(n_rows) >>> x = np.linalg.lstsq(A, b, rcond=None) >>> Ab = np.hstack((A, b.reshape(-1,1))) >>> SAb = linalg.clarkson_woodruff_transform(Ab, sketch_n_rows) >>> SA, Sb = SAb[:,:-1], SAb[:,-1] >>> x_sketched = np.linalg.lstsq(SA, Sb, rcond=None)
As with the matrix norm example,
np.linalg.norm(A @ x - b)
is close tonp.linalg.norm(A @ x_sketched - b)
with high probability.