Human readable format for clusters of word vectors

Question

Let's say I have pretrained word2vec model and apply it to dataset consisting of article titles from "The Guardian". It seems pretty obvious that titles coming from "Science" section would form one cluster in latent space and titles from "Fashion" section would form another cluster in latent space. But the thing is my dataset doesn't have category label for each title. How can I come up with such human readable interpretation of cluster centers(probably coming from Kmeans)?

Erwan · Answer

The usual way is to present the top N (e.g. top 10) words for the cluster:

With distance-based clustering like K-means, the top words can be picked as the closest ones to the centroid.
With probabilistic methods such as LDA, the top words are the ones with the highest probability for the topic.

Human readable format for clusters of word vectors

One Answer

Add your own answers!

Ask a Question