1.16. Probability calibration

When performing classification you often want not only to predict the classlabel, but also obtain a probability of the respective label. This probabilitygives you some kind of confidence on the prediction. Some models can give youpoor estimates of the class probabilities and some even do not supportprobability prediction. The calibration module allows you to better calibratethe probabilities of a given model, or to add support for probabilityprediction.

Well calibrated classifiers are probabilistic classifiers for which the outputof the predict_proba method can be directly interpreted as a confidence level.For instance, a well calibrated (binary) classifier should classify the samplessuch that among the samples to which it gave a predict_proba value close to 0.8,approximately 80% actually belong to the positive class. The following plot compareshow well the probabilistic predictions of different classifiers are calibrated:

../_images/sphx_glr_plot_compare_calibration_0011.png

LogisticRegression returns well calibrated predictions by default as it directlyoptimizes log-loss. In contrast, the other methods return biased probabilities;with different biases per method:

  • GaussianNB tends to push probabilities to 0 or 1 (note thecounts in the histograms). This is mainly because it makes the assumptionthat features are conditionally independent given the class, which is notthe case in this dataset which contains 2 redundant features.
  • RandomForestClassifier shows the opposite behavior: the histogramsshow peaks at approximately 0.2 and 0.9 probability, while probabilities close to0 or 1 are very rare. An explanation for this is given by Niculescu-Miziland Caruana 4: “Methods such as bagging and random forests that averagepredictions from a base set of models can have difficulty making predictionsnear 0 and 1 because variance in the underlying base models will biaspredictions that should be near zero or one away from these values. Becausepredictions are restricted to the interval [0,1], errors caused by variancetend to be one-sided near zero and one. For example, if a model shouldpredict p = 0 for a case, the only way bagging can achieve this is if allbagged trees predict zero. If we add noise to the trees that bagging isaveraging over, this noise will cause some trees to predict values largerthan 0 for this case, thus moving the average prediction of the baggedensemble away from 0. We observe this effect most strongly with randomforests because the base-level trees trained with random forests haverelatively high variance due to feature subsetting.” As a result, thecalibration curve also referred to as the reliability diagram (Wilks 1995 5) shows acharacteristic sigmoid shape, indicating that the classifier could trust its“intuition” more and return probabilities closer to 0 or 1 typically.
  • Linear Support Vector Classification (LinearSVC) shows an even more sigmoid curveas the RandomForestClassifier, which is typical for maximum-margin methods(compare Niculescu-Mizil and Caruana 4), which focus on hard samplesthat are close to the decision boundary (the support vectors).

Two approaches for performing calibration of probabilistic predictions areprovided: a parametric approach based on Platt’s sigmoid model and anon-parametric approach based on isotonic regression (sklearn.isotonic).Probability calibration should be done on new data not used for model fitting.The class CalibratedClassifierCV uses a cross-validation generator andestimates for each split the model parameter on the train samples and thecalibration of the test samples. The probabilities predicted for thefolds are then averaged. Already fitted classifiers can be calibrated byCalibratedClassifierCV via the parameter cv=”prefit”. In this case,the user has to take care manually that data for model fitting and calibrationare disjoint.

The following images demonstrate the benefit of probability calibration.The first image present a dataset with 2 classes and 3 blobs ofdata. The blob in the middle contains random samples of each class.The probability for the samples in this blob should be 0.5.

../_images/sphx_glr_plot_calibration_0011.png

The following image shows on the data above the estimated probabilityusing a Gaussian naive Bayes classifier without calibration,with a sigmoid calibration and with a non-parametric isotoniccalibration. One can observe that the non-parametric modelprovides the most accurate probability estimates for samplesin the middle, i.e., 0.5.

../_images/sphx_glr_plot_calibration_0021.png

The following experiment is performed on an artificial dataset for binaryclassification with 100,000 samples (1,000 of them are used for model fitting)with 20 features. Of the 20 features, only 2 are informative and 10 areredundant. The figure shows the estimated probabilities obtained withlogistic regression, a linear support-vector classifier (SVC), and linear SVC withboth isotonic calibration and sigmoid calibration.The Brier score is a metric which is a combination of calibration loss and refinement loss,brier_score_loss, reported in the legend (the smaller the better).Calibration loss is defined as the mean squared deviation from empirical probabilitiesderived from the slope of ROC segments. Refinement loss can be defined as the expectedoptimal loss as measured by the area under the optimal cost curve.

../_images/sphx_glr_plot_calibration_curve_0021.png

One can observe here that logistic regression is well calibrated as its curve isnearly diagonal. Linear SVC’s calibration curve or reliability diagram has asigmoid curve, which is typical for an under-confident classifier. In the case ofLinearSVC, this is caused by the margin property of the hinge loss, which letsthe model focus on hard samples that are close to the decision boundary(the support vectors). Both kinds of calibration can fix this issue and yieldnearly identical results. The next figure shows the calibration curve ofGaussian naive Bayes on the same data, with both kinds of calibration and alsowithout calibration.

../_images/sphx_glr_plot_calibration_curve_0011.png

One can see that Gaussian naive Bayes performs very badly but does so in another way than linear SVC: While linear SVC exhibited a sigmoid calibrationcurve, Gaussian naive Bayes’ calibration curve has a transposed-sigmoid shape.This is typical for an over-confident classifier. In this case, the classifier’soverconfidence is caused by the redundant features which violate the naive Bayesassumption of feature-independence.

Calibration of the probabilities of Gaussian naive Bayes with isotonicregression can fix this issue as can be seen from the nearly diagonalcalibration curve. Sigmoid calibration also improves the brier score slightly,albeit not as strongly as the non-parametric isotonic calibration. This is anintrinsic limitation of sigmoid calibration, whose parametric form assumes asigmoid rather than a transposed-sigmoid curve. The non-parametric isotoniccalibration model, however, makes no such strong assumptions and can deal witheither shape, provided that there is sufficient calibration data. In general,sigmoid calibration is preferable in cases where the calibration curve is sigmoidand where there is limited calibration data, while isotonic calibration ispreferable for non-sigmoid calibration curves and in situations where largeamounts of data are available for calibration.

CalibratedClassifierCV can also deal with classification tasks thatinvolve more than two classes if the base estimator can do so. In this case,the classifier is calibrated first for each class separately in an one-vs-restfashion. When predicting probabilities for unseen data, the calibratedprobabilities for each class are predicted separately. As those probabilitiesdo not necessarily sum to one, a postprocessing is performed to normalize them.

The next image illustrates how sigmoid calibration changes predictedprobabilities for a 3-class classification problem. Illustrated is the standard2-simplex, where the three corners correspond to the three classes. Arrows pointfrom the probability vectors predicted by an uncalibrated classifier to theprobability vectors predicted by the same classifier after sigmoid calibrationon a hold-out validation set. Colors indicate the true class of an instance(red: class 1, green: class 2, blue: class 3).

../_images/sphx_glr_plot_calibration_multiclass_0011.png

The base classifier is a random forest classifier with 25 base estimators(trees). If this classifier is trained on all 800 training datapoints, it isoverly confident in its predictions and thus incurs a large log-loss.Calibrating an identical classifier, which was trained on 600 datapoints, withmethod=’sigmoid’ on the remaining 200 datapoints reduces the confidence of thepredictions, i.e., moves the probability vectors from the edges of the simplextowards the center:

../_images/sphx_glr_plot_calibration_multiclass_0021.png

This calibration results in a lower log-loss. Note that an alternative wouldhave been to increase the number of base estimators which would have resulted ina similar decrease in log-loss.

References:

  • Obtaining calibrated probability estimates from decision treesand naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001

  • Transforming Classifier Scores into Accurate MulticlassProbability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)

  • Probabilistic Outputs for Support Vector Machines and Comparisons toRegularized Likelihood Methods, J. Platt, (1999)

  • 4(1,2)
  • Predicting Good Probabilities with Supervised Learning,A. Niculescu-Mizil & R. Caruana, ICML 2005

  • 5

  • On the combination of forecast probabilities forconsecutive precipitation periods. Wea. Forecasting, 5, 640–650.,Wilks, D. S., 1990a