TransWikia.com

How to dealt with combine form of words in Natural Language processing?

Data Science Asked by Syed Mohsin Karim on July 21, 2021

This problem is related to natural language processing out of a specific language context.

Combination of words having their own self meaning, like North America, South Asia, Albert Einstein etc.
Some Problems I faced:

  • Tokenization, would separation these words too. (so it is meaningless)
  • N-Gramization, makes some garbage and non-meaningful combinations too plus bags of words will be also increased.
  • TFIDF, in that case, first name and last name probability will differ and increase feature words too. Means South and Africa both words having different probability scores and in the case of n-gramization, it will be generated irrelevant words probabilities too. If I ignore fewer probability words it will make some sense, but the question still left. how many n_gram words should be generated and will increase also processing time and corpus length and how to dealt with first, middle, and last name. because many irrelevant words make even high probabilities list, in general, it is.
  • Named Entity Recognition, mostly dealt with full naming corpora, means I need heavy corpus with proper labels to recognize these names, and every day new names generate, so it is also unhandled.

I want a solution to that problem, in an optimized manner to dealing this type of issue.

Note: I am working on language, don’t have a rich corpus like the English language.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP