Cross Validated Asked on December 15, 2021
I have 7000 SMS messages, 6000 ham, 1000 spam. Typical messages are:
Ham: Yo, any way we could pick something up tonight?
Spam: Great News! Call FREEFONE 08006344447 to claim your guaranteed £1000 CASH or £2000 gift.
I want to implement a supervised classifier that would predict the ham/spam label given a new SMS.
The two classifiers have I tried are as follows:
Simple-predictor, where I count how many elements in the following keywords
[
"!", "click", "visit", "reply", "subscribe", "free", "price", "offer",
"claim code", "charge", "stop", "unlimited", "expires", "£",
"new voicemail", "cash prize", "special-call"
]
are substrings of the (decapitalized) SMS message and predict spam if the count is greater than 1, ham otherwise. The method achieves
accuracy (correct guesses ratio): 0.9742822966507177
sensitivity (correct spam guesses ratio): 0.8452380952380952
Bayes (monograms) predictor, where I split the SMS into a tokens list $L = [t_1, t_2, …, t_n]$ (e.g. for the ham message above $L$ would be ['yo', 'any', 'way', ..., 'tonight']
), compare the quantities :
$s = P(spam) cdot P(t_1 | spam) cdot ldots cdot P(t_n | spam)$,
$h = P(ham) cdot P(t_1 | ham) cdot ldots cdot P(t_n | ham)$,
and predict spam if $s > h$, ham otherwise.
$P(spam), P(ham), P(token | spam), P(token | ham)$ are estimated from the training data.
This method achieves
accuracy: 0.9881889763779528
sensitivity: 0.9312977099236641
when trained on 4000 messages and tested on the other 3000 messages
What new idea could I try to obtain a classifier with better prediction scores?
Note that I have already tried ‘tuning’ both Simple-predictor (e.g. trying different keywords list, changing the count threshold, etc.) and Bayes predictor (e.g. performance of bigrams predictor is worse due to a limited training set size) to achieve these scores. Now I am looking for a new idea.
Basically any method text classification can be applied here. If you want to stick with classical MT methods, you can try:
Different model (logistic regression, SVM),
Feature engineering (e.g., replacing all phone numbers with a special token, removing stop words, in case of discriminative models, you can weight the input with TF-IDF scores, including n-gram features),
Word embeddings (such as GloVe or FastText) as an input of a discriminative model.
If you do not care about the inference time, you can try some neural models. 7k messages should be enough to train a small LSTM classifier and definitely enough to fine-tune BERT or RoBERTa.
Answered by Jindřich on December 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP