I have a need to classify documents in a set of documents, which grows over time from a small tagged training set. The classification is a binary classification. Training on the tagged set produces good results, but as the set increases in size as does the size of the vocabulary the model degrades. The model is a Boosted NaiveBayes applied to a tfidf representation of the text. Each document is a reasonably sized news report.
What is the state of the art solution for such a problem? I was thinking a semi-supervised approach might be a way forward, tagging new data using the previous model to create a new model but it doesn’t seem successful.
Get help from others!