TransWikia.com

How can I get a measure of the semantic similarity of words?

Data Science Asked on May 24, 2021

What is the best way to figure out the semantic similarity of words? Word2Vec is okay, but not ideal:

# Using the 840B word Common Crawl GloVe vectors with gensim:

# 'hot' is closer to 'cold' than 'warm'
In [7]: model.similarity('hot', 'cold')
Out[7]: 0.59720456121072973

In [8]: model.similarity('hot', 'warm')
Out[8]: 0.56784095376659627

# Cold is much closer to 'hot' than 'popular'
In [9]: model.similarity('hot', 'popular')
Out[9]: 0.33708479049537632

NLTK’s Wordnet methods appear to just give up:

In [25]: print wn.synset('hot.a.01').path_similarity(wn.synset('warm.a.01'))
None

What are other options?

5 Answers

In Text Analytic Tools for Semantic Similarity, they developed a algorithm in order to find the similarity between 2 sentences. But if you read closely, they find the similarity of the word in a matrix and sum together to find out the similarity between sentences. So, it might be a shot to check word similarity.

Also in SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation, they explain the difference between association and similarity which is probably the reason for your observation as well. For example, coffee and a cup. They are not similar but they are associative. So just considering similarity would give a different result. The authors suggest various models to estimate them.

Answered by Hima Varsha on May 24, 2021

Word2vec does not capture similarity based on antonyms and synonyms. Word2vec would give a higher similarity if the two words have the similar context. Eg The weather in California was _____ . The blank could be filled by both hot and cold hence the similarity would be higher. This concept is called Paradigmatic relations.

If you are interested to capture relations such as hypernyms, hyponyms, synonyms, antonym you would have to use any wordnet based similarity measure. There are many similarity measures based on wordnet. You may check this link

Answered by Trideep Rath on May 24, 2021

Word2vec is a good starting point for most scenarios. It does capture semantics by way of prediction using CBOW method. It allows translations (as most repeated example I can put here again), V(King) - V(Queen) ~~ V(men) - V(women) and so on.

So what is the problem? The issue lies in word sense ambiguity. Whenever the word itself has two different meaning in two different context, the word vector will tend to really be away from either context. Python ~ Boa (both snakes) and Python - Java (both programming languages)..

Any alternative?

For the very specific purpose of "synonyms" if you want Wordnet would be ideal place. It captures explicit relationship of two words rather than implicit relation based on usage and occurrences.

Wordnet is mostly crafted as a dictionary - where as word2vec is mined by usage.

Answered by Dipan Mehta on May 24, 2021

In a context free grammar, I think it is really kind of impossible to determine the closeness of words. What you can do is use lexicon vectors and then if a word is close in values between two lexicons then the value should be close.

Answered by Josh on May 24, 2021

GloVe Will "Most Likely" Work For Your Purposes

I found myself with a question similar to yours about 1 month ago. I met with some fellow data scientists that had more experience with NLP word vectorization than me. After reviewing many options, I felt that Global Vectors (GloVe) would work best for me. It is doing well for my purposes, and, for my purposes, I have found that training on my own available specialized corpora (plural for corpus = a bunch of documents), I was able to get good utility for my synonym searching needs.

The process was introduced HERE, and clarified for me HERE, but I found the most help as a python user HERE, which will give you guidance on how to use trained models. Using python, I found I that I could only make use of

pip install glove==1.0.0 per THIS StackOverflow Answer

Follow THIS for an idea of how to train your own corpus. IF you need to train your GloVe model from your own corpus, 80%+ of your work will be deciding how to collect and condition your corpus to create your vocabulary and your co-occurrence matrix - you want to do this part very well. Justification for training on a specialized corpus in another domain is reported HERE, which was nice to find for encouragement to do the training work on a specialized corpus. I encourage anyone to evaluate whether or not this is necessary given your application.

I'm in the process of having domain experts blindly evaluate pretrained models against the models trained on our corpora. I'll try to remember to update this post once I have those results.

Answered by Thom Ives on May 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP