Classifiers and accuracy

Question

I would like to ask you how to use classifier and determine accuracy of models.
I have my dataset and I already cleaned the text (remove stopwords, punctuation, removed empty rows,...).
Then I split it into train and test.
Since I want to determine if an email is spam or not, I have used the common classifiers, I.e. Naive Bayes, SVM and logistic regression.
Here I just included my train and test datasets: nothing else!
I am using Python to run this analysis.
My question is: should it be enough or should I implement new algorithms?
If you could provide me with an example of how an already existing algorithm was improved it would be also good.
I read a lot of literature regarding accuracy of text classification and in all the papers the authors use SVM, Naïve Bayes, logistic regression to classify spam.
But I do not know if they built their own classifier or just used the existing one in Python.
Any experience on this?

Ashwin Geet D'Sa · Accepted Answer

The question mixes two different notions: models (or algorithms) and accuracy. Let me clarify them.

Model (or Algorithm) is a classification technique and 'Accuracy' is one of the ways to evaluate the performance of the models.

You can choose any models(Naive bayes, SVM or other deep learning techniques) to implement your classifier. They are independent of 'Accuracy' or 'F1' or any other measures by which you want to test the performance.

At first, you shall pre-process the text (remove stopwords, punctuation, etc.) , and pre-processing is a choice that says how the data should look like before getting into the model. They do influence the model's performance, but not to a great extent when done right. Usually, pre-processing is applied on both the train and test set.

Model performance: Once you implement you model, you may want to see how well it generalises (performamce on unseen data). So, you shall split the dataset into two halves: training and test set. (Usually most of the authors split into 3 portions: training set, validation set(to avoid over-fitting) and test set). You shall train the model with training set, and the test set is used to evaluate the performance of the model.

Model evaluation: Once the model is trained on training data, predict the labels on the test set. So, you have two set of labels on test set: 1: ground truth (the actual labels indicated by the test set) and 2: predicted labels (the labels predicted by the model).
Now, use the evaluation metric of your choice (Let's assume that you want to choose 'accuracy' as evaluation metric).
Accuracy can simply be calculates as: (#Number of correctly predicted samples / #Total number of samples) * 100 . Where #Number of correctly predicted samples is the count of samples for which the ground truth label and predicted labels are the same.

Classifiers and accuracy

One Answer

Add your own answers!

Ask a Question