Data Science Asked by adihere on August 8, 2020
I have categorized 800,000 documents into 500 categories using the Mahout topic modelling.
Instead of representing the topic using the top 5/10 words for each topics, I want to infer a generic name for the group using any existing algorithm.
For the time being, I have used the following algorithm to arrive at the name for the topic:
For each topic
Please suggest a approach to arrive at more relevant name for the topics.
If you don't want to dig into much NLP in that task, I suggest you to generate a set of most frequent NGrams (of lengths 2-5) from your documents and find the most distinct ngrams for each category using TF*IDF metric as sense importance of a particular ngram (normalizing measure by word count) and selecting those Ngrams that are used in a particular category and are not (or rarely) used in others.
Answered by chewpakabra on August 8, 2020
I can suggest several papers on this topic:
You can find more by looking at their citations.
Answered by Emre on August 8, 2020
You might try using word vectors to average the top N words in a topic and then using the cosine similarity to find the closest word in the corpus?
Just a quick and dirty an idea...
Answered by CpILL on August 8, 2020
A few ideas you'll often see..
Answered by Learning stats by example on August 8, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP