TransWikia.com

Word analysis in Python

Data Science Asked on December 12, 2020

I have a list of documents which look like this:

["Display is flickering"]
["Battery charger is broken"]
["Hard disk is making noises"]

These text documents are just free text.
I have processed with tokenization, lemmatization, stop words removal, and now I want to assign tags based on a list of words. Example:

{"#display":["display","screen","lcd","led"]}
{"#battery":["battery","power cord","charger","drains"]}
{"#hard disk":["hard disk","performance","slow"]}

After text normalization I have:

["Display is flickering"] -> ["display","flicker"]

What technique is recommended to compare document: ["display","flicker"]
with my dictionary of words and return which value matches the best?
In this case I would like:

["display","flicker"] = "#display":"display"
["battery","charger","broke"] = "#battery":"charger"

Basically it compares Document A in tokens with a list B of other documents and return which document in list B with more common matches.

I’m using TF, but want to know if there are other techniques, code samples to use.

2 Answers

You can use word embedding in order to compare whole phrases. I am aware about two models: Google's word2vec and Stanford's GloVe. Now, word embedding works best with, well - words. However, you could concatenate every word in your phrase and re-train the models. Afterwards, you could calculate their similarity (say, with cosine similarity) and see how similar your whole phrases are semantically.

Hope this helps.

Correct answer by Belphegor on December 12, 2020

What you try to do is called multiclass and multilabel text classification. Check the tutorials here.

Answered by Diego on December 12, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP