1.9. Naive Bayes

Naive Bayes methods are a set of supervised learning algorithmsbased on applying Bayes’ theorem with the “naive” assumption ofconditional independence between every pair of features given thevalue of the class variable. Bayes’ theorem states the followingrelationship, given class variable

1.9. Naive Bayes - 图1 and dependent featurevector1.9. Naive Bayes - 图2 through1.9. Naive Bayes - 图3, :

1.9. Naive Bayes - 图4

Using the naive conditional independence assumption that

1.9. Naive Bayes - 图5

for all

1.9. Naive Bayes - 图6, this relationship is simplified to

1.9. Naive Bayes - 图7

Since

1.9. Naive Bayes - 图8 is constant given the input,we can use the following classification rule:

1.9. Naive Bayes - 图9

and we can use Maximum A Posteriori (MAP) estimation to estimate

1.9. Naive Bayes - 图10 and1.9. Naive Bayes - 图11;the former is then the relative frequency of class1.9. Naive Bayes - 图12in the training set.

The different naive Bayes classifiers differ mainly by the assumptions theymake regarding the distribution of

1.9. Naive Bayes - 图13.

In spite of their apparently over-simplified assumptions, naive Bayesclassifiers have worked quite well in many real-world situations, famouslydocument classification and spam filtering. They require a small amountof training data to estimate the necessary parameters. (For theoreticalreasons why naive Bayes works well, and on which types of data it does, seethe references below.)

Naive Bayes learners and classifiers can be extremely fast compared to moresophisticated methods.The decoupling of the class conditional feature distributions means that eachdistribution can be independently estimated as a one dimensional distribution.This in turn helps to alleviate problems stemming from the curse ofdimensionality.

On the flip side, although naive Bayes is known as a decent classifier,it is known to be a bad estimator, so the probability outputs frompredict_proba are not to be taken too seriously.

References:

1.9.1. Gaussian Naive Bayes

GaussianNB implements the Gaussian Naive Bayes algorithm forclassification. The likelihood of the features is assumed to be Gaussian:

1.9. Naive Bayes - 图14

The parameters

1.9. Naive Bayes - 图15 and1.9. Naive Bayes - 图16are estimated using maximum likelihood.

>>>

  1. >>> from sklearn.datasets import load_iris
  2. >>> from sklearn.model_selection import train_test_split
  3. >>> from sklearn.naive_bayes import GaussianNB
  4. >>> X, y = load_iris(return_X_y=True)
  5. >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
  6. >>> gnb = GaussianNB()
  7. >>> y_pred = gnb.fit(X_train, y_train).predict(X_test)
  8. >>> print("Number of mislabeled points out of a total %d points : %d"
  9. ... % (X_test.shape[0], (y_test != y_pred).sum()))
  10. Number of mislabeled points out of a total 75 points : 4

1.9.2. Multinomial Naive Bayes

MultinomialNB implements the naive Bayes algorithm for multinomiallydistributed data, and is one of the two classic naive Bayes variants used intext classification (where the data are typically represented as word vectorcounts, although tf-idf vectors are also known to work well in practice).The distribution is parametrized by vectors

1.9. Naive Bayes - 图17for each class1.9. Naive Bayes - 图18, where1.9. Naive Bayes - 图19 is the number of features(in text classification, the size of the vocabulary)and1.9. Naive Bayes - 图20 is the probability1.9. Naive Bayes - 图21of feature1.9. Naive Bayes - 图22 appearing in a sample belonging to class1.9. Naive Bayes - 图23.

The parameters

1.9. Naive Bayes - 图24 is estimated by a smoothedversion of maximum likelihood, i.e. relative frequency counting:

1.9. Naive Bayes - 图25

where

1.9. Naive Bayes - 图26 isthe number of times feature1.9. Naive Bayes - 图27 appears in a sample of class1.9. Naive Bayes - 图28in the training set1.9. Naive Bayes - 图29,and1.9. Naive Bayes - 图30 is the total count ofall features for class1.9. Naive Bayes - 图31.

The smoothing priors

1.9. Naive Bayes - 图32 accounts forfeatures not present in the learning samples and prevents zero probabilitiesin further computations.Setting1.9. Naive Bayes - 图33 is called Laplace smoothing,while1.9. Naive Bayes - 图34 is called Lidstone smoothing.

1.9.3. Complement Naive Bayes

ComplementNB implements the complement naive Bayes (CNB) algorithm.CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithmthat is particularly suited for imbalanced data sets. Specifically, CNB usesstatistics from the complement of each class to compute the model’s weights.The inventors of CNB show empirically that the parameter estimates for CNB aremore stable than those for MNB. Further, CNB regularly outperforms MNB (oftenby a considerable margin) on text classification tasks. The procedure forcalculating the weights is as follows:

1.9. Naive Bayes - 图35

where the summations are over all documents

1.9. Naive Bayes - 图36 not in class1.9. Naive Bayes - 图37,1.9. Naive Bayes - 图38 is either the count or tf-idf value of term1.9. Naive Bayes - 图39 in document1.9. Naive Bayes - 图40,1.9. Naive Bayes - 图41 is a smoothing hyperparameter like that found inMNB, and1.9. Naive Bayes - 图42. The second normalization addressesthe tendency for longer documents to dominate parameter estimates in MNB. Theclassification rule is:

1.9. Naive Bayes - 图43

i.e., a document is assigned to the class that is the poorest complementmatch.

References:

1.9.4. Bernoulli Naive Bayes

BernoulliNB implements the naive Bayes training and classificationalgorithms for data that is distributed according to multivariate Bernoullidistributions; i.e., there may be multiple features but each one is assumedto be a binary-valued (Bernoulli, boolean) variable.Therefore, this class requires samples to be represented as binary-valuedfeature vectors; if handed any other kind of data, a BernoulliNB instancemay binarize its input (depending on the binarize parameter).

The decision rule for Bernoulli naive Bayes is based on

1.9. Naive Bayes - 图44

which differs from multinomial NB’s rulein that it explicitly penalizes the non-occurrence of a feature

1.9. Naive Bayes - 图45that is an indicator for class1.9. Naive Bayes - 图46,where the multinomial variant would simply ignore a non-occurring feature.

In the case of text classification, word occurrence vectors (rather than wordcount vectors) may be used to train and use this classifier. BernoulliNBmight perform better on some datasets, especially those with shorter documents.It is advisable to evaluate both models, if time permits.

References:

1.9.5. Categorical Naive Bayes

CategoricalNB implements the categorical naive Bayesalgorithm for categorically distributed data. It assumes that each feature,which is described by the index

1.9. Naive Bayes - 图47, has its own categoricaldistribution.

For each feature

1.9. Naive Bayes - 图48 in the training set1.9. Naive Bayes - 图49,CategoricalNB estimates a categorical distribution for each feature iof X conditioned on the class y. The index set of the samples is defined as1.9. Naive Bayes - 图50, with1.9. Naive Bayes - 图51 as the number of samples.

The probability of category

1.9. Naive Bayes - 图52 in feature1.9. Naive Bayes - 图53 given class1.9. Naive Bayes - 图54 is estimated as:

1.9. Naive Bayes - 图55

where

1.9. Naive Bayes - 图56 is the numberof times category1.9. Naive Bayes - 图57 appears in the samples1.9. Naive Bayes - 图58, which belongto class1.9. Naive Bayes - 图59,1.9. Naive Bayes - 图60 is the numberof samples with class c,1.9. Naive Bayes - 图61 is a smoothing parameter and1.9. Naive Bayes - 图62 is the number of available categories of feature1.9. Naive Bayes - 图63.

CategoricalNB assumes that the sample matrix

1.9. Naive Bayes - 图64 is encoded(for instance with the help of OrdinalEncoder) such that allcategories for each feature1.9. Naive Bayes - 图65 are represented with numbers1.9. Naive Bayes - 图66 where1.9. Naive Bayes - 图67 is the number of available categoriesof feature1.9. Naive Bayes - 图68.

1.9.6. Out-of-core naive Bayes model fitting

Naive Bayes models can be used to tackle large scale classification problemsfor which the full training set might not fit in memory. To handle this case,MultinomialNB, BernoulliNB, and GaussianNBexpose a partial_fit method that can be usedincrementally as done with other classifiers as demonstrated inOut-of-core classification of text documents. All naive Bayesclassifiers support sample weighting.

Contrary to the fit method, the first call to partial_fit needs to bepassed the list of all the expected class labels.

For an overview of available strategies in scikit-learn, see also theout-of-core learning documentation.

Note

The partial_fit method call of naive Bayes models introduces somecomputational overhead. It is recommended to use data chunk sizes that are aslarge as possible, that is as the available RAM allows.