Clustering together words that appear together while down weighting words that appear too often

Question

I was wondering if I could get some help finding a good model for the problem I have. I have a data set where each observation is a set of words that go together. So for example, it could be:
Obseration 1: {car, pizza, exhaust, engine}
Observation 2: {car, pizza, engine}
Observation 3: {food, pizza, chips}
.
.
.
Observation x: {ballons, air, pizza}
Observation x + 1: {car, exhaust}
I am trying to find a model that, when given a word (for example, "car"), it returns the words that are most commonly used with that word. One way to do this is to use cosine similarity however, there is an additional constraint I am trying to handle. In the example above, the word "pizza" is a word that's super common in a lot of observations. The thing is that, because it's common with so many observations and topics, I don't want to include it as one of the synonymous words that get shown or, at the very least, decrease the probability that it gets selected via some sort of weighting method.
Essentially, I am looking for words that go together often with each other but don't go together universally with many other words (if that makes sense).
Any models you guys have in mind? I would really appreciate the help! I have thought of doing something like a zipf's law weighting of some sort where the most common words are down weighted based inversely on how frequently they show up but wondering if there are some machine learning methods that are just built for this already!
Thank you for any responses!

Clustering together words that appear together while down weighting words that appear too often

Add your own answers!

Ask a Question