Word analysis in Python

Question

I have a list of documents which look like this:
["Display is flickering"]
["Battery charger is broken"]
["Hard disk is making noises"]

These text documents are just free text.
I have processed with tokenization, lemmatization, stop words removal, and now I want to assign tags based on a list of words. Example:
{"#display":["display","screen","lcd","led"]}
{"#battery":["battery","power cord","charger","drains"]}
{"#hard disk":["hard disk","performance","slow"]}

After text normalization I have:
["Display is flickering"] -> ["display","flicker"]

What technique is recommended to compare document: ["display","flicker"]
with my dictionary of words and return which value matches the best?
In this case I would like:
["display","flicker"] = "#display":"display"
["battery","charger","broke"] = "#battery":"charger"

Basically it compares Document A in tokens with a list B of other documents and return which document in list B with more common matches.
I'm using TF, but want to know if there are other techniques, code samples to use.

Belphegor · Accepted Answer

You can use word embedding in order to compare whole phrases. I am aware about two models: Google's word2vec and Stanford's GloVe. Now, word embedding works best with, well - words. However, you could concatenate every word in your phrase and re-train the models. Afterwards, you could calculate their similarity (say, with cosine similarity) and see how similar your whole phrases are semantically.

Hope this helps.

Diego · Answer

What you try to do is called multiclass and multilabel text classification. Check the tutorials here.

Word analysis in Python

2 Answers

Add your own answers!

Ask a Question