6.7. Kernel Approximation

This submodule contains functions that approximate the feature mappings thatcorrespond to certain kernels, as they are used for example in support vectormachines (see Support Vector Machines).The following feature functions perform non-linear transformations of theinput, which can serve as a basis for linear classification or otheralgorithms.

The advantage of using approximate explicit feature maps compared to thekernel trick,which makes use of feature maps implicitly, is that explicit mappingscan be better suited for online learning and can significantly reduce the costof learning with very large datasets.Standard kernelized SVMs do not scale well to large datasets, but using anapproximate kernel map it is possible to use much more efficient linear SVMs.In particular, the combination of kernel map approximations withSGDClassifier can make non-linear learning on large datasets possible.

Since there has not been much empirical work using approximate embeddings, itis advisable to compare results against exact kernel methods when possible.

See also

Polynomial regression: extending linear models with basis functions for an exact polynomial transformation.

6.7.1. Nystroem Method for Kernel Approximation

The Nystroem method, as implemented in Nystroem is a general methodfor low-rank approximations of kernels. It achieves this by essentially subsamplingthe data on which the kernel is evaluated.By default Nystroem uses the rbf kernel, but it can use anykernel function or a precomputed kernel matrix.The number of samples used - which is also the dimensionality of the features computed -is given by the parameter n_components.

6.7.2. Radial Basis Function Kernel

The RBFSampler constructs an approximate mapping for the radial basisfunction kernel, also known as Random Kitchen Sinks[RR2007]. Thistransformation can be used to explicitly model a kernel map, prior to applyinga linear algorithm, for example a linear SVM:

>>>

  1. >>> from sklearn.kernel_approximation import RBFSampler
  2. >>> from sklearn.linear_model import SGDClassifier
  3. >>> X = [[0, 0], [1, 1], [1, 0], [0, 1]]
  4. >>> y = [0, 0, 1, 1]
  5. >>> rbf_feature = RBFSampler(gamma=1, random_state=1)
  6. >>> X_features = rbf_feature.fit_transform(X)
  7. >>> clf = SGDClassifier(max_iter=5)
  8. >>> clf.fit(X_features, y)
  9. SGDClassifier(max_iter=5)
  10. >>> clf.score(X_features, y)
  11. 1.0

The mapping relies on a Monte Carlo approximation to thekernel values. The fit function performs the Monte Carlo sampling, whereasthe transform method performs the mapping of the data. Because of theinherent randomness of the process, results may vary between different calls tothe fit function.

The fit function takes two arguments:n_components, which is the target dimensionality of the feature transform,and gamma, the parameter of the RBF-kernel. A higher n_components willresult in a better approximation of the kernel and will yield results moresimilar to those produced by a kernel SVM. Note that “fitting” the featurefunction does not actually depend on the data given to the fit function.Only the dimensionality of the data is used.Details on the method can be found in [RR2007].

For a given value of n_components RBFSampler is often less accurateas Nystroem. RBFSampler is cheaper to compute, though, makinguse of larger feature spaces more efficient.

../_images/sphx_glr_plot_kernel_approximation_0021.pngComparing an exact RBF kernel (left) with the approximation (right)

Examples:

6.7.3. Additive Chi Squared Kernel

The additive chi squared kernel is a kernel on histograms, often used in computer vision.

The additive chi squared kernel as used here is given by

6.7. Kernel Approximation - 图2

This is not exactly the same as sklearn.metrics.additive_chi2_kernel.The authors of [VZ2010] prefer the version above as it is always positivedefinite.Since the kernel is additive, it is possible to treat all components

6.7. Kernel Approximation - 图3 separately for embedding. This makes it possible to samplethe Fourier transform in regular intervals, instead of approximatingusing Monte Carlo sampling.

The class AdditiveChi2Sampler implements this component wisedeterministic sampling. Each component is sampled

6.7. Kernel Approximation - 图4 times, yielding6.7. Kernel Approximation - 图5 dimensions per input dimension (the multiple of two stemsfrom the real and complex part of the Fourier transform).In the literature,6.7. Kernel Approximation - 图6 is usually chosen to be 1 or 2, transformingthe dataset to size n_samples 5 n_features (in the case of6.7. Kernel Approximation - 图7).

The approximate feature map provided by AdditiveChi2Sampler can be combinedwith the approximate feature map provided by RBFSampler to yield an approximatefeature map for the exponentiated chi squared kernel.See the [VZ2010] for details and [VVZ2010] for combination with the RBFSampler.

6.7.4. Skewed Chi Squared Kernel

The skewed chi squared kernel is given by:

6.7. Kernel Approximation - 图8

It has properties that are similar to the exponentiated chi squared kerneloften used in computer vision, but allows for a simple Monte Carloapproximation of the feature map.

The usage of the SkewedChi2Sampler is the same as the usage describedabove for the RBFSampler. The only difference is in the freeparameter, that is called

6.7. Kernel Approximation - 图9.For a motivation for this mapping and the mathematical details see [LS2010].

6.7.5. Mathematical Details

Kernel methods like support vector machines or kernelizedPCA rely on a property of reproducing kernel Hilbert spaces.For any positive definite kernel function

6.7. Kernel Approximation - 图10 (a so called Mercer kernel),it is guaranteed that there exists a mapping6.7. Kernel Approximation - 图11into a Hilbert space6.7. Kernel Approximation - 图12, such that

6.7. Kernel Approximation - 图13

Where

6.7. Kernel Approximation - 图14 denotes the inner product in theHilbert space.

If an algorithm, such as a linear support vector machine or PCA,relies only on the scalar product of data points

6.7. Kernel Approximation - 图15, one may usethe value of6.7. Kernel Approximation - 图16, which corresponds to applying the algorithmto the mapped data points6.7. Kernel Approximation - 图17.The advantage of using6.7. Kernel Approximation - 图18 is that the mapping6.7. Kernel Approximation - 图19 never hasto be calculated explicitly, allowing for arbitrary largefeatures (even infinite).

One drawback of kernel methods is, that it might be necessaryto store many kernel values

6.7. Kernel Approximation - 图20 during optimization.If a kernelized classifier is applied to new data6.7. Kernel Approximation - 图21,6.7. Kernel Approximation - 图22 needs to be computed to make predictions,possibly for many different6.7. Kernel Approximation - 图23 in the training set.

The classes in this submodule allow to approximate the embedding

6.7. Kernel Approximation - 图24, thereby working explicitly with the representations6.7. Kernel Approximation - 图25, which obviates the need to apply the kernelor store training examples.

References: