What should be the ratio of True vs False cases in a binary classifier dataset?

Question

I am using a CNN for sentiment analysis of news articles. It is a binary classification with outputs: Interesting & Uninteresting.
In my dataset, there are around 50,000 Uninteresting articles and only about 200 Interesting articles. I know the ratio is badly skewed.

My question is what should be the ratio in such a scenario.
One approach that I want to try is to cluster the Uninteresting news
articles and take a sample from each cluster for training. Is there
a better approach?

wacax · Accepted Answer

Ideal true vs false ratios don't exist and they should reflect the the reality the best they can, you can always remove negatives if the ratio is too skewed to improve training speed though. Let me explain it with an example. Ads CTR is as old as the internet and it's skewed to less than 1% positives vs. plus 99% negatives. Yet, data scientists prefer to train it on the entire dataset because many negatives will include information that models couldn't find otherwise. They might not provide a lot of information as a positive one but they are still somewhat important. There are approaches where CTR ratios get artificially rebalanced by sampling in case you want a swifter training and it will still work. In your case, positives are 0.4% which resemble CTR on ads so you can: gather more data to increase the number of positives in order to better understand what makes an article interesting. In case that is not possible trying ensembles which often improve prediction performance.

Clustering is an unsupervised approach so you would be losing information by doing so (training labels) besides, sentence embeddings (representations)  of one big cluster of negatives and a tiny cluster of positives do not convey information as well as word embeddings which have already been trained on billions of documents.

In addition, running k-means on categorical variables will yield anomalous clusters because it's meant to be used with continuous variables. You can find more information about the topic on the following links:

Kmeans: Whether to standardise? Can you use categorical variables? Is Cluster 3.0 suitable?

My data set contains a number of numeric attributes and one categorical

Kaggle

Why does K means clustering perform poorly on categorical data The weakness of the K means method is that it is applicable only when the mean is defined one needs to specify K in advance and it is unable to handle noisy data and outliers

Therefore, you should use high dimensional embeddings or representations to cluster meanings together, this has been explored in word meanings but for sentences or articles, a vector representation becomes more complicated to implement. One possible approach is the Word Movers’ Distance but there are many more possible approaches, you should google them. In addition a non-linear clustering algorithm such as t-sne will  probably yield better results than k-means using the embeddings approach.
A better approach is:

to use multiple models and compare their performance on this dataset. I have the impression that there will be certain keywords that make articles interesting, so a bag of words will still be helpful, even as a starter model.

Use feature engineering. Your model might be overloooking important features, such as article length, reading time, number of paragraphs, ratio of complex words  (measured by length), etc. Feature engineering is always important in case you haven't used it yet.

Use pretrained embeddings. CNN and RNN models can use pretrained embeddings such as GloVe, Word2Vec or FastText so you use better representations plus other complex layers later on in the architecture. This is extremely important to increase accuracy.

Use metrics to measure improvement and ranks to check for the best predicted interesting articles.

What should be the ratio of True vs False cases in a binary classifier dataset?

One Answer

Add your own answers!

Ask a Question