TransWikia.com

What should be the ratio of True vs False cases in a binary classifier dataset?

Data Science Asked by Shreyans Jasoriya on April 12, 2021

I am using a CNN for sentiment analysis of news articles. It is a binary classification with outputs: Interesting & Uninteresting.
In my dataset, there are around 50,000 Uninteresting articles and only about 200 Interesting articles. I know the ratio is badly skewed.

  1. My question is what should be the ratio in such a scenario.
  2. One approach that I want to try is to cluster the Uninteresting news
    articles and take a sample from each cluster for training. Is there
    a better approach?

One Answer

  1. Ideal true vs false ratios don't exist and they should reflect the the reality the best they can, you can always remove negatives if the ratio is too skewed to improve training speed though. Let me explain it with an example. Ads CTR is as old as the internet and it's skewed to less than 1% positives vs. plus 99% negatives. Yet, data scientists prefer to train it on the entire dataset because many negatives will include information that models couldn't find otherwise. They might not provide a lot of information as a positive one but they are still somewhat important. There are approaches where CTR ratios get artificially rebalanced by sampling in case you want a swifter training and it will still work. In your case, positives are 0.4% which resemble CTR on ads so you can: gather more data to increase the number of positives in order to better understand what makes an article interesting. In case that is not possible trying ensembles which often improve prediction performance.

  2. Clustering is an unsupervised approach so you would be losing information by doing so (training labels) besides, sentence embeddings (representations) of one big cluster of negatives and a tiny cluster of positives do not convey information as well as word embeddings which have already been trained on billions of documents.

In addition, running k-means on categorical variables will yield anomalous clusters because it's meant to be used with continuous variables. You can find more information about the topic on the following links:

Therefore, you should use high dimensional embeddings or representations to cluster meanings together, this has been explored in word meanings but for sentences or articles, a vector representation becomes more complicated to implement. One possible approach is the Word Movers’ Distance but there are many more possible approaches, you should google them. In addition a non-linear clustering algorithm such as t-sne will probably yield better results than k-means using the embeddings approach.

A better approach is:

  1. to use multiple models and compare their performance on this dataset. I have the impression that there will be certain keywords that make articles interesting, so a bag of words will still be helpful, even as a starter model.

  2. Use feature engineering. Your model might be overloooking important features, such as article length, reading time, number of paragraphs, ratio of complex words (measured by length), etc. Feature engineering is always important in case you haven't used it yet.

  3. Use pretrained embeddings. CNN and RNN models can use pretrained embeddings such as GloVe, Word2Vec or FastText so you use better representations plus other complex layers later on in the architecture. This is extremely important to increase accuracy.

  4. Use metrics to measure improvement and ranks to check for the best predicted interesting articles.

Correct answer by wacax on April 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP