1.2. Linear and Quadratic Discriminant Analysis

Linear Discriminant Analysis(discriminant_analysis.LinearDiscriminantAnalysis) and QuadraticDiscriminant Analysis(discriminant_analysis.QuadraticDiscriminantAnalysis) are two classicclassifiers, with, as their names suggest, a linear and a quadratic decisionsurface, respectively.

These classifiers are attractive because they have closed-form solutions thatcan be easily computed, are inherently multiclass, have proven to work well inpractice, and have no hyperparameters to tune.

ldaqda

The plot shows decision boundaries for Linear Discriminant Analysis andQuadratic Discriminant Analysis. The bottom row demonstrates that LinearDiscriminant Analysis can only learn linear boundaries, while QuadraticDiscriminant Analysis can learn quadratic boundaries and is therefore moreflexible.

Examples:

Linear and Quadratic Discriminant Analysis with covariance ellipsoid: Comparison of LDA and QDAon synthetic data.

1.2.1. Dimensionality reduction using Linear Discriminant Analysis

discriminant_analysis.LinearDiscriminantAnalysis can be used toperform supervised dimensionality reduction, by projecting the input data to alinear subspace consisting of the directions which maximize the separationbetween classes (in a precise sense discussed in the mathematics sectionbelow). The dimension of the output is necessarily less than the number ofclasses, so this is, in general, a rather strong dimensionality reduction, andonly makes sense in a multiclass setting.

This is implemented indiscriminant_analysis.LinearDiscriminantAnalysis.transform. The desireddimensionality can be set using the n_components constructor parameter.This parameter has no influence ondiscriminant_analysis.LinearDiscriminantAnalysis.fit ordiscriminant_analysis.LinearDiscriminantAnalysis.predict.

Examples:

Comparison of LDA and PCA 2D projection of Iris dataset: Comparison of LDA and PCAfor dimensionality reduction of the Iris dataset

1.2.2. Mathematical formulation of the LDA and QDA classifiers

Both LDA and QDA can be derived from simple probabilistic models which modelthe class conditional distribution of the data

1.2. Linear and Quadratic Discriminant Analysis - 图2 for each class1.2. Linear and Quadratic Discriminant Analysis - 图3. Predictions can then be obtained by using Bayes’ rule:

1.2. Linear and Quadratic Discriminant Analysis - 图4

and we select the class

1.2. Linear and Quadratic Discriminant Analysis - 图5 which maximizes this conditional probability.

More specifically, for linear and quadratic discriminant analysis,

1.2. Linear and Quadratic Discriminant Analysis - 图6 is modeled as a multivariate Gaussian distribution withdensity:

1.2. Linear and Quadratic Discriminant Analysis - 图7

where

1.2. Linear and Quadratic Discriminant Analysis - 图8 is the number of features.

To use this model as a classifier, we just need to estimate from the trainingdata the class priors

1.2. Linear and Quadratic Discriminant Analysis - 图9 (by the proportion of instances of class1.2. Linear and Quadratic Discriminant Analysis - 图10), the class means1.2. Linear and Quadratic Discriminant Analysis - 图11 (by the empirical sample class means)and the covariance matrices (either by the empirical sample class covariancematrices, or by a regularized estimator: see the section on shrinkage below).

In the case of LDA, the Gaussians for each class are assumed to share the samecovariance matrix:

1.2. Linear and Quadratic Discriminant Analysis - 图12 for all1.2. Linear and Quadratic Discriminant Analysis - 图13. This leads tolinear decision surfaces, which can be seen by comparing thelog-probability ratios1.2. Linear and Quadratic Discriminant Analysis - 图14:

1.2. Linear and Quadratic Discriminant Analysis - 图15

In the case of QDA, there are no assumptions on the covariance matrices

1.2. Linear and Quadratic Discriminant Analysis - 图16 of the Gaussians, leading to quadratic decision surfaces. See3 for more details.

Note

Relation with Gaussian Naive Bayes

If in the QDA model one assumes that the covariance matrices are diagonal,then the inputs are assumed to be conditionally independent in each class,and the resulting classifier is equivalent to the Gaussian Naive Bayesclassifier naive_bayes.GaussianNB.

1.2.3. Mathematical formulation of LDA dimensionality reduction

To understand the use of LDA in dimensionality reduction, it is useful to startwith a geometric reformulation of the LDA classification rule explained above.We write

1.2. Linear and Quadratic Discriminant Analysis - 图17 for the total number of target classes. Since in LDA weassume that all classes have the same estimated covariance1.2. Linear and Quadratic Discriminant Analysis - 图18, wecan rescale the data so that this covariance is the identity:

1.2. Linear and Quadratic Discriminant Analysis - 图19

Then one can show that to classify a data point after scaling is equivalent tofinding the estimated class mean

1.2. Linear and Quadratic Discriminant Analysis - 图20 which is closest to the datapoint in the Euclidean distance. But this can be done just as well afterprojecting on the1.2. Linear and Quadratic Discriminant Analysis - 图21 affine subspace1.2. Linear and Quadratic Discriminant Analysis - 图22 generated by all the1.2. Linear and Quadratic Discriminant Analysis - 图23 for all classes. This shows that, implicit in the LDAclassifier, there is a dimensionality reduction by linear projection onto a1.2. Linear and Quadratic Discriminant Analysis - 图24 dimensional space.

We can reduce the dimension even more, to a chosen

1.2. Linear and Quadratic Discriminant Analysis - 图25, by projectingonto the linear subspace1.2. Linear and Quadratic Discriminant Analysis - 图26 which maximizes the variance of the1.2. Linear and Quadratic Discriminant Analysis - 图27 after projection (in effect, we are doing a form of PCA for thetransformed class means1.2. Linear and Quadratic Discriminant Analysis - 图28). This1.2. Linear and Quadratic Discriminant Analysis - 图29 corresponds to then_components parameter used in thediscriminant_analysis.LinearDiscriminantAnalysis.transform method. See3 for more details.

1.2.4. Shrinkage

Shrinkage is a tool to improve estimation of covariance matrices in situationswhere the number of training samples is small compared to the number offeatures. In this scenario, the empirical sample covariance is a poorestimator. Shrinkage LDA can be used by setting the shrinkage parameter ofthe discriminant_analysis.LinearDiscriminantAnalysis class to ‘auto’.This automatically determines the optimal shrinkage parameter in an analyticway following the lemma introduced by Ledoit and Wolf 4. Note thatcurrently shrinkage only works when setting the solver parameter to ‘lsqr’or ‘eigen’.

The shrinkage parameter can also be manually set between 0 and 1. Inparticular, a value of 0 corresponds to no shrinkage (which means the empiricalcovariance matrix will be used) and a value of 1 corresponds to completeshrinkage (which means that the diagonal matrix of variances will be used asan estimate for the covariance matrix). Setting this parameter to a valuebetween these two extrema will estimate a shrunk version of the covariancematrix.

shrinkage

1.2.5. Estimation algorithms

The default solver is ‘svd’. It can perform both classification and transform,and it does not rely on the calculation of the covariance matrix. This can bean advantage in situations where the number of features is large. However, the‘svd’ solver cannot be used with shrinkage.

The ‘lsqr’ solver is an efficient algorithm that only works for classification.It supports shrinkage.

The ‘eigen’ solver is based on the optimization of the between class scatter towithin class scatter ratio. It can be used for both classification andtransform, and it supports shrinkage. However, the ‘eigen’ solver needs tocompute the covariance matrix, so it might not be suitable for situations witha high number of features.

Examples:

Normal and Shrinkage Linear Discriminant Analysis for classification: Comparison of LDA classifierswith and without shrinkage.

References:

  • 3(1,2)
  • “The Elements of Statistical Learning”, Hastie T., Tibshirani R.,Friedman J., Section 4.3, p.106-119, 2008.

  • 4

  • Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix.The Journal of Portfolio Management 30(4), 110-119, 2004.