How to grow a list of related words based on initial keywords?

Question

I recently saw a cool feature that was once available in Google Sheets: you start by writing a few related keywords in consecutive cells, say: "blue", "green", "yellow", and it automatically generates similar keywords (in this case, other colors). See more examples in this YouTube video.

I would like to reproduce this in my own program. I'm thinking of using Freebase, and it would work like this intuitively:

Retrieve the list of given words in Freebase;
Find their "common denominator(s)" and construct a distance metric based on this;
Rank other concepts based on their "distance" to the original keywords;
Display the next closest concepts.

As I'm not familiar with this area, my questions are:

Is there a better way to do this?
What tools are available for each step?

joews · Accepted Answer

The word2vec algorithm may be a good way to retrieve more elements for a list of similar words. It is an unsupervised "deep learning" algorithm that has previously been demonstrated with Wikipedia-based training data (helper scripts are provided on the Google code page).

There are currently C and Python implementations. This tutorial by Radim Řehůřek, the author of the Gensim topic modelling library, is an excellent place to start.

The "single topic" demonstration on the tutorial is a good example of retreiving similar words to a single term (try searching on 'red' or 'yellow'). It should be possible to extend this technique to find the words that have the greatest overall similarity to a set of input words.

Charlie Greenbacker · Answer

Have you considered a frequency-based approach exploiting simple word co-occurence in corpora? At least, that's what I've seen most folks use for this. I think it might be covered briefly in Manning and Schütze's book, and I seem to remember something like this as a homework assignment back in grad school...
More background here.
For this step:

Rank other concepts based on their "distance" to the original keywords;

There are several semantic similarity metrics you could look into. Here's a link to some slides I put together for a class project using a few of these similarity metrics in WordNet.

DaL · Answer

This is one of the nice problem where the scope might vary from an homework assignment to a Google size project.

Indeed, you can start with co-occurrence of the words (e.g., conditional probability). 
You will discover quickly that you get the list of stop words as related the most of the words simply because they are very popular.
Using the lift of conditional probability will take care of the stop words but will make the relation prone to error in small number (most of your cases).
You might try Jacard but since it is symmetric there will be many relations it won't find.

Then you might consider relations that appear only in short distance from the base word. You can (and should) consider relations base on general corpus's (e.g., Wikipedia) and user specific (e.g., his emails).

Very shortly you will have plenty of relatedness measures, when all the measures are good and have some advantage over the others.

In order to combine such measures, I like to reduce the problem into a classification problem.

You should build a data set of paris of words and label them as "is related".
In order to build a large labeled dataset you can:

Use sources of known related words (e.g., good old Wikipedia categories) for positives
Most of the word not known as related are not related.

Then use all the measures you have as features of the pairs.
Now you are in the domain of supervised classification problem.
Build a classifier on the data set, evaluated according to your needs and get a similarity measure that fits your needs.

How to grow a list of related words based on initial keywords?

3 Answers

Add your own answers!

Ask a Question