Data Science Asked by Uttakarsh Tikku on April 30, 2021
I have a corpus of 23000 documents that need to be classified into 5 different categories. I do not have any labeled data available to me, just freeform text documents and labels(yes, one-word labels, not topics).
So I followed a 2-step approach:
I have attempted the following approaches for step 2:
But I haven’t been successful in getting good results for my document classifier. Are there any other methods that can be used to classify the documents?
All help is greatly appreciated.
You could simply encode the documents using BERT and cluster the documents based on their content provided they sufficiently different in terms of the kind of content they contain.
Another approach would be to train a document segmentation model which would segment documents based on their semantic structures and then classify the documents based on their masked skeletons. This however would require a large dataset to train. Fortunately you can find one online called PubLayNet. Augment that with a few representations of your documents for better generalization over the test set.
I've read about the second approach being implemented to classify patents, legal documents, research papers etc. With good results. However it would take a long time to train.
I'd recommend simply clustering the documents based on their text embedding (point 1) and then naming the clusters. If that doesn't work satisfactorily, try the deep learning method for document semantic masking.
Answered by tehem on April 30, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP