TransWikia.com

Distance to the Center of Vocabulary of Word Embeddings

Data Science Asked by user2969402 on January 1, 2021

Suppose I:

  1. Generate a set of word embeddings (using word2vec or similar) based on a large but specific corpus
  2. Compute the centroid of all the words in the set
  3. Find the word(s) with the smallest (say Euclidean for instance) distance to that centroid

Would I be correct to assume that this word (or words) represent the core concept or something like the main theme of the corpus ?

Now suppose I were to do the same thing but diachronically, training different sets of word vectors on separate corpora for the last 5 decades. (One set of embeddings for the 2010’s, one for the 2000’s, etc…). Could I capture something like a shift in zeitgeist over time?

I am aware of some previous research on semantic drift using embeddings, like Histwords [https://nlp.stanford.edu/projects/histwords/]. However they track the position of a single, predetermined word over time. I would be more interested in the "discovery" of central concepts at certain points in time in a specific discursive corpus.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP