Data Science Asked by nassimhddd on September 15, 2020
I recently saw a cool feature that was once available in Google Sheets: you start by writing a few related keywords in consecutive cells, say: “blue”, “green”, “yellow”, and it automatically generates similar keywords (in this case, other colors). See more examples in this YouTube video.
I would like to reproduce this in my own program. I’m thinking of using Freebase, and it would work like this intuitively:
As I’m not familiar with this area, my questions are:
The word2vec algorithm may be a good way to retrieve more elements for a list of similar words. It is an unsupervised "deep learning" algorithm that has previously been demonstrated with Wikipedia-based training data (helper scripts are provided on the Google code page).
There are currently C and Python implementations. This tutorial by Radim Řehůřek, the author of the Gensim topic modelling library, is an excellent place to start.
The "single topic" demonstration on the tutorial is a good example of retreiving similar words to a single term (try searching on 'red' or 'yellow'). It should be possible to extend this technique to find the words that have the greatest overall similarity to a set of input words.
Correct answer by joews on September 15, 2020
Have you considered a frequency-based approach exploiting simple word co-occurence in corpora? At least, that's what I've seen most folks use for this. I think it might be covered briefly in Manning and Schütze's book, and I seem to remember something like this as a homework assignment back in grad school...
More background here.
For this step:
Rank other concepts based on their "distance" to the original keywords;
There are several semantic similarity metrics you could look into. Here's a link to some slides I put together for a class project using a few of these similarity metrics in WordNet.
Answered by Charlie Greenbacker on September 15, 2020
This is one of the nice problem where the scope might vary from an homework assignment to a Google size project.
Indeed, you can start with co-occurrence of the words (e.g., conditional probability). You will discover quickly that you get the list of stop words as related the most of the words simply because they are very popular. Using the lift of conditional probability will take care of the stop words but will make the relation prone to error in small number (most of your cases). You might try Jacard but since it is symmetric there will be many relations it won't find.
Then you might consider relations that appear only in short distance from the base word. You can (and should) consider relations base on general corpus's (e.g., Wikipedia) and user specific (e.g., his emails).
Very shortly you will have plenty of relatedness measures, when all the measures are good and have some advantage over the others.
In order to combine such measures, I like to reduce the problem into a classification problem.
You should build a data set of paris of words and label them as "is related". In order to build a large labeled dataset you can:
Then use all the measures you have as features of the pairs. Now you are in the domain of supervised classification problem. Build a classifier on the data set, evaluated according to your needs and get a similarity measure that fits your needs.
Answered by DaL on September 15, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP