Data Science Asked on January 2, 2022
I have been trying to write a naive bayes classifier from scratch that is supposed to predict the class label of the nominal car.arff dataset. However the classifier always predicts the most common one. I have tried log probabilities and laplace correction, both to no avail. Also I have noticed that the conditional probabilities for any attribute is always the greatest for the most common label. Is this because of my dataset? What can be done about it?
Here is my code:
import numpy as np
import pandas as pd
from scipy.io import arff
def parser(path):
"""
function which parses the data from an arff file
@param path: string containig the path to file
@return array containing the data
@raise FileNotFoundError exception in case if the path does not point to a valid file
"""
start = 0 # check if data really occured
# Declaratives as constant to avoid misspelling in code
RELATION = 'relation'
ATTRIBUTE = 'attribute'
DATA = 'data'
# Create dictionary holding the arff information
data = {RELATION: [],
ATTRIBUTE: [],
DATA: []}
# Read the file and analyse the data
with open(path) as file:
for line in file.readlines():
# Check if line is empty
if line.strip() == '':
continue
# Check if line contains the relation
elif '@' + RELATION in line:
data[RELATION].append(line.replace('@' + RELATION, '').strip())
# Check if line contains an attribute
elif line.startswith('@attribute'):
tmp = line.replace("{", "").replace("}", "").replace("n", "").replace("'", "")
# checks if whitespaces between commas in attributes occur
if (len(tmp.split(" ")) > 3):
values = tmp.replace(",", "").split(" ")[2:]
else:
values = tmp.split(" ")[2].split(",")
data[ATTRIBUTE].append({'name': tmp.split(" ")[1], 'values': values})
# check if @data exists
elif '@' + DATA in line:
start = 1
# If the line is not one of the others, it has to be data
elif '@' + DATA not in line and start:
line = line.split(',')
# strip each element of the line
for i in range(len(line)):
line[i] = line[i].strip()
# Add data to dictionary
data[DATA].append(line)
attributes = np.array(data['attribute'])
out = []
for i in range(len(data['data'])):
data_dict = {}
for j in range(len(attributes)):
data_dict.update({attributes[j]['name']: data['data'][i][j]})
out.append(data_dict)
out = np.array(out)
return out, data[ATTRIBUTE]
class NaiveBayes():
def __init__(self, data, atts, class_label):
self.data = data
self.atts = atts
self.class_label = class_label
def prior(self):
prior_probabilities = [0,0,0,0]
for i in range(len(self.data)):
if self.data[i]['class'] == 'unacc': prior_probabilities[0] += 1
if self.data[i]['class'] == 'acc': prior_probabilities[1] += 1
if self.data[i]['class'] == 'good': prior_probabilities[2] += 1
if self.data[i]['class'] == 'vgood': prior_probabilities[3] += 1
prior_probabilities = [x/len(self.data) for x in prior_probabilities]
return prior_probabilities
def conditionalProbability(self,key,value,length):
#returns (in our case) 4 vector for one attribute with probabilities for each outcome
conditional_probabilities = [0]*length
#definetly not the most efficient way
for i in range(len(self.data)):
if self.data[i][key] == value:
if self.data[i]['class'] == 'unacc': conditional_probabilities[0] += 1
if self.data[i]['class'] == 'acc': conditional_probabilities[1] += 1
if self.data[i]['class'] == 'good': conditional_probabilities[2] += 1
if self.data[i]['class'] == 'vgood': conditional_probabilities[3] += 1
s = np.sum(conditional_probabilities)
conditional_probabilities = [x/s for x in conditional_probabilities]
return conditional_probabilities
def classification(self, instance):
cprobs = []
probs = self.prior()
for key in instance.keys():
cprobs.append(self.conditionalProbability(key,instance[key],4))
print(cprobs)
#get probabilities
predicted_class = "unacc"
for i in range(len(cprobs)-1):
for j in range(4):
probs[j]*=cprobs[i][j]
#print(instance)
print(probs)
return probs.index(max(probs))
raw,atts = parser('car.arff')
class_attribute = 'class'
classifier = NaiveBayes(raw,atts,class_attribute)
print(classifier.data[1])
print(classifier.prior())
print(classifier.conditionalProbability('buying','vhigh',4))
print(classifier.classification(classifier.data[0]))
'''
results = [0,0,0,0]
for i in range(len(classifier.data)):
results[classifier.classification(classifier.data[i])]+=1
print(results)
'''
This is the class distribution and some more information:
% 5. Number of Instances: 1728
% (instances completely cover the attribute space)
%
% 6. Number of Attributes: 6
%
% 7. Attribute Values:
%
% buying v-high, high, med, low
% maint v-high, high, med, low
% doors 2, 3, 4, 5-more
% persons 2, 4, more
% lug_boot small, med, big
% safety low, med, high
%
% 8. Missing Attribute Values: none
%
% 9. Class Distribution (number of instances per class)
%
% class N N[%]
% -----------------------------
% unacc 1210 (70.023 %)
% acc 384 (22.222 %)
% good 69 ( 3.993 %)
% v-good 65 ( 3.762 %)
and here is some sample data:
low,low,5more,more,small,low,unacc
low,low,5more,more,small,med,acc
low,low,5more,more,small,high,good
low,low,5more,more,med,low,unacc
low,low,5more,more,med,med,good
low,low,5more,more,med,high,vgood
low,low,5more,more,big,low,unacc
low,low,5more,more,big,med,good
low,low,5more,more,big,high,vgood
The complete dataset can be found here
Looking at your distribution over classes, it is heavily unbalanced and this can skew the model to predicting the majority class, which in this case is 'unacc'
. So, one recommendation would be to balance out the classes, typically by adding more instances of the minority classes to be equal to the majority class.
Also, looking at your sample data, there seems to be little, if not no variation between the buying, maint, doors and persons and here it looks like these features would not impact the classification decision.
In this case, I would go back to exploring the data and seeing which features could affect the classification decision. This can be done with bar plots and histograms. When doing this divide the data into the classes and plot the distribution of the features by class so you can see if there is any noticeable variation in distribution of these features by their classes.
Answered by shepan6 on January 2, 2022
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP