Compare cross validation values of Bernoulli NB and Multinomial NB

Question

I'm testing the Multinomial NB and Bernoulli NB on my dataset and I'm using the cross validation score to better understand which of the two algorithms work better. This is the first classifier:
from sklearn.naive_bayes import MultinomialNB
clf_multinomial = MultinomialNB()
clf_multinomial.fit(X_train, y_train)
y_predicted = clf_multinomial.predict(X_test) 
score = clf_multinomial.score(X_test, y_test)
scores = cross_val_score(clf_multinomial, X_train, y_train, cv=5)
print(scores)
print(score)

And these are the scores:
[0.75       0.875      0.66666667 0.95833333 0.86956522]
0.8637666498061035

This is the second classifier:
from sklearn.naive_bayes import BernoulliNB
clf_multivariate = BernoulliNB()
clf_multivariate.fit(X_train, y_train)
y_predicted = clf_multivariate.predict(X_test) 
score = clf_multivariate.score(X_test, y_test)
scores = cross_val_score(clf_multivariate, X_train, y_train, cv=5)
print(scores)
print(score)

And these are the scores:
[0.5        0.5        0.54166667 0.54166667 0.52173913]
0.5

From what I understood from the answer posted here, the first classifier works better because my dataset has lots of features (11k) instead of just 1. However, it's pretty strange that I got 0.5 in the second classifier which is an high value considering the number of features. What are the differences between the classifiers?

Erwan · Answer

The difference is explained in the documentation:

Like MultinomialNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.

So it's not a matter of number of features, it's about how the values of the features are interpreted: Multinomial can deal with multiple discrete values, whereas Bernouilli deals only with binary variables.
The doc also mentions this option:

binarize: float or None, default=0.0
Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.

Since you didn't provide a value for this option in your code, the default 0.0 applies. This means that when using Bernouilli all the variables are converted to
binary variables: anything lower or equal to 0 is 0, anything higher is 1. This explains why Bernouilli works with your data, albeit not as well as Multinomial: probably for many features in your data the fact that the value is zero or not is a good indication for the label.

Compare cross validation values of Bernoulli NB and Multinomial NB

One Answer

Add your own answers!

Ask a Question