TransWikia.com

What type of consideration can be made using clustering?

Data Science Asked on February 7, 2021

I am clustering my data to see how information look like and which group may be identified.
Since clustering is an unsupervised algorithm, I cannot test the accuracy of the classification.
So I was wondering what type of consideration I can make after using clustering.
For example, if I had many emails, with no flag or label for spam/not spam, how could I use clustering to group them into two groups and test the ‘accuracy’ of the clustering?

To give more context on what I am trying to do: I have different files (csv) having fields like date, users, emails’ subjects and emails’ bodies.
I would like to run some analysis but, in order to do this, I would need to classify emails into spam/not spam.
I have 23000 emails so it is very difficult to do this manually. I already included in a list of words the common words used as flag for spam (ads, buy, offer, porn, promotion,…) but, since the most of emails has no these words in a title or in the body, this first step can assign ‘spam’ flag to around 100 emails. Very low! I have tried with topic classification (lda) but it is not so accurate. I thought then to use k-means clustering to assign these labels, once labelled manually around 300 emails.
I do not know if this is the right way to proceed for assignignig labels, so comments and answers would be greatly appreciated.

One Answer

This is basic architecture of spam filter :

enter image description here

Statistically,spam bear lower entropy ( i.e., higher similarities) than legitimate emails.

We could use bisect k-means clustering after doing topic modelling. In k-means we had to specify k which lead to drastic change in results and it also leads to empty clusters.

I would recommend going through this paper as it highlight this approach.

Answered by prashant0598 on February 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP