TransWikia.com

Human readable format for clusters of word vectors

Data Science Asked by Arek Żyłkowski on June 8, 2021

Let’s say I have pretrained word2vec model and apply it to dataset consisting of article titles from "The Guardian". It seems pretty obvious that titles coming from "Science" section would form one cluster in latent space and titles from "Fashion" section would form another cluster in latent space. But the thing is my dataset doesn’t have category label for each title. How can I come up with such human readable interpretation of cluster centers(probably coming from Kmeans)?

One Answer

The usual way is to present the top N (e.g. top 10) words for the cluster:

  • With distance-based clustering like K-means, the top words can be picked as the closest ones to the centroid.
  • With probabilistic methods such as LDA, the top words are the ones with the highest probability for the topic.

Answered by Erwan on June 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP