2.8. Density Estimation

Density estimation walks the line between unsupervised learning, featureengineering, and data modeling. Some of the most popular and usefuldensity estimation techniques are mixture models such asGaussian Mixtures (sklearn.mixture.GaussianMixture), andneighbor-based approaches such as the kernel density estimate(sklearn.neighbors.KernelDensity).Gaussian Mixtures are discussed more fully in the context ofclustering, because the technique is also useful asan unsupervised clustering scheme.

Density estimation is a very simple concept, and most people are alreadyfamiliar with one common density estimation technique: the histogram.

2.8.1. Density Estimation: Histograms

A histogram is a simple visualization of data where bins are defined, and thenumber of data points within each bin is tallied. An example of a histogramcan be seen in the upper-left panel of the following figure:

hist_to_kde

A major problem with histograms, however, is that the choice of binning canhave a disproportionate effect on the resulting visualization. Consider theupper-right panel of the above figure. It shows a histogram over the samedata, with the bins shifted right. The results of the two visualizations lookentirely different, and might lead to different interpretations of the data.

Intuitively, one can also think of a histogram as a stack of blocks, one blockper point. By stacking the blocks in the appropriate grid space, we recoverthe histogram. But what if, instead of stacking the blocks on a regular grid,we center each block on the point it represents, and sum the total height ateach location? This idea leads to the lower-left visualization. It is perhapsnot as clean as a histogram, but the fact that the data drive the blocklocations mean that it is a much better representation of the underlyingdata.

This visualization is an example of a kernel density estimation, in this casewith a top-hat kernel (i.e. a square block at each point). We can recover asmoother distribution by using a smoother kernel. The bottom-right plot showsa Gaussian kernel density estimate, in which each point contributes a Gaussiancurve to the total. The result is a smooth density estimate which is derivedfrom the data, and functions as a powerful non-parametric model of thedistribution of points.

2.8.2. Kernel Density Estimation

Kernel density estimation in scikit-learn is implemented in thesklearn.neighbors.KernelDensity estimator, which uses theBall Tree or KD Tree for efficient queries (see Nearest Neighbors fora discussion of these). Though the above exampleuses a 1D data set for simplicity, kernel density estimation can beperformed in any number of dimensions, though in practice the curse ofdimensionality causes its performance to degrade in high dimensions.

In the following figure, 100 points are drawn from a bimodal distribution,and the kernel density estimates are shown for three choices of kernels:

kde_1d_distribution

It’s clear how the kernel shape affects the smoothness of the resultingdistribution. The scikit-learn kernel density estimator can be used asfollows:

>>>

  1. >>> from sklearn.neighbors import KernelDensity
  2. >>> import numpy as np
  3. >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
  4. >>> kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)
  5. >>> kde.score_samples(X)
  6. array([-0.41075698, -0.41075698, -0.41076071, -0.41075698, -0.41075698,
  7. -0.41076071])

Here we have used kernel='gaussian', as seen above.Mathematically, a kernel is a positive function

2.8. Density Estimation - 图3which is controlled by the bandwidth parameter2.8. Density Estimation - 图4.Given this kernel form, the density estimate at a point2.8. Density Estimation - 图5 withina group of points2.8. Density Estimation - 图6 is given by:

2.8. Density Estimation - 图7

The bandwidth here acts as a smoothing parameter, controlling the tradeoffbetween bias and variance in the result. A large bandwidth leads to a verysmooth (i.e. high-bias) density distribution. A small bandwidth leadsto an unsmooth (i.e. high-variance) density distribution.

sklearn.neighbors.KernelDensity implements several common kernelforms, which are shown in the following figure:

kde_kernels

The form of these kernels is as follows:

  • Gaussian kernel (kernel = 'gaussian')

2.8. Density Estimation - 图9

  • Tophat kernel (kernel = 'tophat')

2.8. Density Estimation - 图10 if2.8. Density Estimation - 图11

  • Epanechnikov kernel (kernel = 'epanechnikov')

2.8. Density Estimation - 图12

  • Exponential kernel (kernel = 'exponential')

2.8. Density Estimation - 图13

  • Linear kernel (kernel = 'linear')

2.8. Density Estimation - 图14 if2.8. Density Estimation - 图15

  • Cosine kernel (kernel = 'cosine')

2.8. Density Estimation - 图16 if2.8. Density Estimation - 图17

The kernel density estimator can be used with any of the valid distancemetrics (see sklearn.neighbors.DistanceMetric for a list of available metrics), thoughthe results are properly normalized only for the Euclidean metric. Oneparticularly useful metric is theHaversine distancewhich measures the angular distance between points on a sphere. Hereis an example of using a kernel density estimate for a visualizationof geospatial data, in this case the distribution of observations of twodifferent species on the South American continent:

species_kde

One other useful application of kernel density estimation is to learn anon-parametric generative model of a dataset in order to efficientlydraw new samples from this generative model.Here is an example of using this process tocreate a new set of hand-written digits, using a Gaussian kernel learnedon a PCA projection of the data:

digits_kde

The “new” data consists of linear combinations of the input data, with weightsprobabilistically drawn given the KDE model.

Examples: