TransWikia.com

Discovering important topics in corpus of text using metadata and text content

Data Science Asked on June 14, 2021

I am working on a system to classify documents into important/non-important. I have a large (200,000) sample set of documents which have been pre-labelled and using Naive-Bayes I have achieved 95% accuracy in classifying documents using an 80:20 split of training:test data. This result is amazing and I am confident I can tune the system to achieve better results.

However, the system must be able to be used for an indefinite amount of time. This raises the issue that the content of the documents being classified, or what constitutes important/non-important will change over time, so the model must evolve. Thus I need to find a way to train the model without using pre-labeled data. I have only been able to come up with one potential method for this and it didn’t work.

The first method I have tried is using LDA topic modelling to create a set of topics from the corpus. I then attempted to find a subset of topics which fit the important documents and a subset which fit the non-important topics. Unfortunately this method did not work as the content of the important/non-important documents is too similar to create obviously distinct topics.

Does anyone have a suggestion for how I could achieve this or any resources that could point me in the right direction?

One Answer

There is no way to keep a supervised model update as the data distribution changes over time without labeling new instances.

Answered by Brian Spiering on June 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP