Distance to the Center of Vocabulary of Word Embeddings

Data Science Asked by user2969402 on January 1, 2021

Suppose I:

Generate a set of word embeddings (using word2vec or similar) based on a large but specific corpus
Compute the centroid of all the words in the set
Find the word(s) with the smallest (say Euclidean for instance) distance to that centroid

Would I be correct to assume that this word (or words) represent the core concept or something like the main theme of the corpus ?

Now suppose I were to do the same thing but diachronically, training different sets of word vectors on separate corpora for the last 5 decades. (One set of embeddings for the 2010’s, one for the 2000’s, etc…). Could I capture something like a shift in zeitgeist over time?

I am aware of some previous research on semantic drift using embeddings, like Histwords [https://nlp.stanford.edu/projects/histwords/]. However they track the position of a single, predetermined word over time. I would be more interested in the "discovery" of central concepts at certain points in time in a specific discursive corpus.

embeddings methodology nlp word embeddings word2vec

Add your own answers!

Ask a Question

Get help from others!