Data Science Asked by LGDGODV on September 4, 2021
I am doing a supervised binary text classification task.
I want to classify the texts from site A, site B, and site C.
The in-domain performance looks OK for texts of each site. (92%-94% accuracy).
However, if I applied the model trained on texts of one site directly onto texts of another site(without fine-tuning), the performance downgrades a lot. (7%-16% downgrade for accuracy).
Approaches I already tried:
Doc2vec embedding(trained on texts from one site) + logistic regression.
Bert embedding + logistic regression. (Using bert-as-a-service to generate the embeddings based on google pre-trained bert models).
TF-IDF + logistic regression.
Pre-trained Word2vec embedding(average word embedding for text) + logistic regression.
All of those approaches don’t work very well.
I knew that the performance downgrade is unavoidable, but I would like to get a maybe 3% – 5% downgrade.
Generally the task of recognizing one type of text against "anything else" is a quite difficult problem, since there is so much diversity in text that there cannot be any good representative sample of "anything else".
Typically this problem is treated as a one-class classification problem: the idea is for the learning algorithm to capture what represents the positive class only, considering anything else as negative. To my knowledge this is used mostly for author identification and related stylometry tasks. The PAN workshop series offer a good deal of state of the arts methods and datasets around these tasks.
It is also possible to frame the problem as binary classification, but then one must be very creative with the negative instances in the training set. Probably the main problem with your current approach is this: your negative instances are only "randomly selected among all other topics of the site". This means that the classifier knows only texts from the site on which it is trained, so it has no idea what to do with any new text which doesn't look like anything seen in the training data. A method which has been used to increase the diversity of the negative instances is to automatically generate google queries with a few random words which appear in one of the positive instances, then download whatever text Google retrieves as negative instance.
Another issue with binary classification is the distribution of positive/negative instances: if you train a model with 50/50 positive/negative, the model expects that by default there is 50% chance for each. This can cause a huge bias when applied to a test set which contains mostly negative instances, especially if these don't look like the negative instances seen during training.
Finally be careful about the distinction semantic topic vs. writing style, because the features for these two are usually very different: in the former case the stop words are usually removed, the content words (nouns, verbs, adjectives) are important (hence one uses things like TFIDF). In the latter it's the opposite: stop words and punctuation should be kept (because they are good indicators of writing style) whereas content words are removed because they tend to bias the model the topic instead of the style. In stylometry features based on characters n-grams have been shown to perform well... even though it's not very clear why it works!
Correct answer by Erwan on September 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP