1.3. Kernel ridge regression

1.3. Kernel ridge regression

Kernel ridge regression (KRR) [M2012] combines Ridge regression and classification(linear least squares with l2-norm regularization) with the kernel trick. Itthus learns a linear function in the space induced by the respective kernel andthe data. For non-linear kernels, this corresponds to a non-linearfunction in the original space.

The form of the model learned by KernelRidge is identical to supportvector regression (SVR). However, different loss functions are used:KRR uses squared error loss while support vector regression uses

-insensitive loss, both combined with l2 regularization. Incontrast to SVR, fitting KernelRidge can be done inclosed-form and is typically faster for medium-sized datasets. On the otherhand, the learned model is non-sparse and thus slower than SVR, which learnsa sparse model for, at prediction-time.

The following figure compares KernelRidge and SVR onan artificial dataset, which consists of a sinusoidal target function andstrong noise added to every fifth datapoint. The learned model ofKernelRidge and SVR is plotted, where bothcomplexity/regularization and bandwidth of the RBF kernel have been optimizedusing grid-search. The learned functions are very similar; however, fittingKernelRidge is approx. seven times faster than fitting SVR(both with grid-search). However, prediction of 100000 target values is morethan three times faster with SVR since it has learned a sparse model using onlyapprox. 1/3 of the 100 training datapoints as support vectors.

The next figure compares the time for fitting and prediction ofKernelRidge and SVR for different sizes of the training set.Fitting KernelRidge is faster than SVR for medium-sizedtraining sets (less than 1000 samples); however, for larger training setsSVR scales better. With regard to prediction time, SVR isfaster than KernelRidge for all sizes of the training set because ofthe learned sparse solution. Note that the degree of sparsity and thus theprediction time depends on the parameters

and of theSVR; would correspond to a dense model.

References:

M2012
“Machine Learning: A Probabilistic Perspective”Murphy, K. P. - chapter 14.4.3, pp. 492-493, The MIT Press, 2012