KMeans clusterization on documents

Question

Whether correct or not, I'm not able to judge being myself in the early days of the Data Science.

However, I have applied a Kmeans on a corpus where some random documents (very short sentences) have been added.
These have been vectiorized so to be suitable.

With clusterization results at hands, I was somehow expecting the vectors (keyword) to fall only in one cluster at a time (and no more than that).
This is not the case.

In some circumstances, I have a vector falling in two clusters and I wonder why this is the case.

Is this because of the inappropriate usage of Kmeans on vectors made from documents?
Is this normal as the way Kmeans works (moving the centroids, but de facto assigning objects to the nearest cluster by distance)?
Is this overlap due to the fact that in analysing my results I assess the whole group of items within a cluster and not just (say) the top X near to the center?

-- 
Example:

corpus = [
'The car is driven on the road.',
'The truck is driven on the highway.',
'The train run on the tracks.',
'The bycicle is run on the pavement.',
'The flight is conducted in the air.',
'The baloon is conducted in the air.',
'The bird is flying in the air.',
'The man is walking in the street.',
'The pedestrian is crossing the zebra.',
'The pilot flights the plane].',
'On the route, the car is driven.',
'On the road, the truck is moved.',
'The train is running on the tracks.',
'The bike is running on the pavement.',
'The flight takes place in the sky.',
'Birds don''t fly when is dark',
'The baloon is in the water.',
'The bird flies in the sky.',
'In the road, the guy walks.',
'The pedestrian is passing through the zebra.',
'The pilot is flying the plane.',    
'This is a Japanese doll.',
'I really want to go to work, but I am too sick to drive.',
'Christmas is coming.',
'With the daylight saving time turned off it''s getting dark soon.',
'The body fat may compensates for the loss of nutrients.',
'Mary plays the piano.',
'She always speaks to him in a loud voice.',
'Wow, does that work?',
'I don''t like walking when it is dark',
'Last Friday in three week’s time I saw a spotted striped blue worm shake hands with a legless lizard.',
'My Mum tries to be cool by saying that she likes all the same things that I do.',
'Mummy is saying that she loves me being a pilot when in reality she is scared all the time I take off.',    
'Where do random thoughts come from?',
'A glittering gem is not enough.',
'We need to rent a room for our party.',
'A purple pig and a green donkey flew a kite in the middle of the night and ended up sunburnt.',
'If I don’t like something, I’ll stay away from it.',
'The body may perhaps compensates for the loss of a true metaphysics.',
'Don''t step on the broken glass.',
'It was getting dark, and we weren''t there yet.', 
'Playing an instrument like the guitar takes out the stress from my day.']

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word', 
                         max_df=0.8, 
                         max_features=50000,  
                         lowercase=True
                        )

X = vectorizer.fit_transform(corpus)

from sklearn.cluster import KMeans

num_clusters = 11
kmean = KMeans(n_clusters=num_clusters, random_state=1021)
clusters = kmean.fit_predict(X)

--

If you explore the clusters variable, you will notice the overlaps I am talking about. 
For instance the keyword baloon appeara in both cluster 10 and 0.

There are 12 overlaps, which on a 33 unique keywords dataset represents 1/3, so I won't say something I could be happy with.

Any advice is appreciated.
Thanks

hssay · Answer

Let us assume that your corpus has n distinct keywords. For a k-means algorithm, each keyword is an axis in n dimensional space. Document is a point in that n dimensional space.

K-means algorithm will allocate each point (a document) to a single cluster. When you say a keyword is appearing in two clusters, it probably implies: that particular dimension/keyword is important for both clusters.

Let us take hypothetical example: if you have a patient's blood pressure, cholesterol levels and bunch of other medical parameters. Let us say you discretize blood pressure to 2 or 3 levels. If you run k-means on this data, each patient will be assigned a unique cluster. But it is quite possible that two (or even more) clusters all have patients with > 120 systolic blood pressure.

You need to probably read the results of the k-means more carefully.

Romain Reboulleau · Answer

I think you may be mixing up things. In the example you provided, there are 42 sentences, each is transformed through TfIdfVectorizer, which gives us a sparse matrix of shape (42, 174). Then, each sentence representation as vector is used to cluster with k-means, and each sentence is thus assigned to a cluster.

Single words are not processed, only whole sentences. If the "baloon" keyword appears in two sentences, it does not necessarily mean that both sentences will fall into the same cluster. However, I am surprised by what you state because the sentences containing "baloon" both fall into the same cluster (#7). This makes me think that you misinterpreted the results.

>>> import numpy as np
>>> np.argwhere(["baloon" in sentence for sentence in corpus])
array([[ 5],
       [16]], dtype=int64)
>>> clusters[5]
7
>>> clusters[16]
7

Anyway, it could be that sentences containing "baloon" fall into different clusters. This depends on the other words in the sentence, the number of clusters, the rest of the dataset and the clustering method. For instance, it could be the case if sentences containing "baloon" were not so much alike.

KMeans clusterization on documents

2 Answers

Add your own answers!

Ask a Question