6.6. Random Projection

The sklearn.random_projection module implements a simple andcomputationally efficient way to reduce the dimensionality of the data bytrading a controlled amount of accuracy (as additional variance) for fasterprocessing times and smaller model sizes. This module implements two types ofunstructured random matrix:Gaussian random matrix andsparse random matrix.

The dimensions and distribution of random projections matrices arecontrolled so as to preserve the pairwise distances between any twosamples of the dataset. Thus random projection is a suitable approximationtechnique for distance based method.

References:

6.6.1. The Johnson-Lindenstrauss lemma

The main theoretical result behind the efficiency of random projection is theJohnson-Lindenstrauss lemma (quoting Wikipedia):

In mathematics, the Johnson-Lindenstrauss lemma is a resultconcerning low-distortion embeddings of points from high-dimensionalinto low-dimensional Euclidean space. The lemma states that a small setof points in a high-dimensional space can be embedded into a space ofmuch lower dimension in such a way that distances between the points arenearly preserved. The map used for the embedding is at least Lipschitz,and can even be taken to be an orthogonal projection.

Knowing only the number of samples, thesklearn.random_projection.johnson_lindenstrauss_min_dim estimatesconservatively the minimal size of the random subspace to guarantee abounded distortion introduced by the random projection:

>>>

  1. >>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
  2. >>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5)
  3. 663
  4. >>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01])
  5. array([ 663, 11841, 1112658])
  6. >>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1)
  7. array([ 7894, 9868, 11841])

../_images/sphx_glr_plot_johnson_lindenstrauss_bound_0011.png

../_images/sphx_glr_plot_johnson_lindenstrauss_bound_0021.png

Example:

References:

6.6.2. Gaussian random projection

The sklearn.random_projection.GaussianRandomProjection reduces thedimensionality by projecting the original input space on a randomly generatedmatrix where components are drawn from the following distribution

6.6. Random Projection - 图3.

Here a small excerpt which illustrates how to use the Gaussian randomprojection transformer:

>>>

  1. >>> import numpy as np
  2. >>> from sklearn import random_projection
  3. >>> X = np.random.rand(100, 10000)
  4. >>> transformer = random_projection.GaussianRandomProjection()
  5. >>> X_new = transformer.fit_transform(X)
  6. >>> X_new.shape
  7. (100, 3947)

6.6.3. Sparse random projection

The sklearn.random_projection.SparseRandomProjection reduces thedimensionality by projecting the original input space using a sparserandom matrix.

Sparse random matrices are an alternative to dense Gaussian randomprojection matrix that guarantees similar embedding quality while being muchmore memory efficient and allowing faster computation of the projected data.

If we define s = 1 / density, the elements of the random matrixare drawn from

6.6. Random Projection - 图4

where

6.6. Random Projection - 图5 is the size of the projected subspace.By default the density of non zero elements is set to the minimum density asrecommended by Ping Li et al.:6.6. Random Projection - 图6.

Here a small excerpt which illustrates how to use the sparse randomprojection transformer:

>>>

  1. >>> import numpy as np
  2. >>> from sklearn import random_projection
  3. >>> X = np.random.rand(100, 10000)
  4. >>> transformer = random_projection.SparseRandomProjection()
  5. >>> X_new = transformer.fit_transform(X)
  6. >>> X_new.shape
  7. (100, 3947)

References: