Should I go for a 'balanced' dataset or a 'representative' dataset?

Question

My 'machine learning' task is of separating benign Internet traffic from malicious traffic. In the real world scenario, most (say 90% or more) of Internet traffic is benign. Thus I felt that I should choose a similar data setup for training my models as well. But I came across a research paper or two (in my area of work) which have used a "class balancing" data approach to training the models, implying an equal number of instances of benign and malicious traffic.

In general, if I am building machine learning models, should I go for a dataset which is representative of the real world problem, or is a balanced dataset better suited for building the models (since certain classifiers do not behave well with class imbalance, or due to other reasons not known to me)?

Can someone shed more light on the pros and cons of both the choices and how to decide which one to go choose?

DSea · Accepted Answer

I would say the answer depends on your use case. Based on my experience:

If you're trying to build a representative model -- one that describes the data rather than necessarily predicts -- then I would suggest using a representative sample of your data.
If you want to build a predictive model, particularly one that performs well by measure of AUC or rank-order and plan to use a basic ML framework (i.e. Decision Tree, SVM, Naive Bayes, etc), then I would suggest you feed the framework a balanced dataset. Much of the literature on class imbalance finds that random undersampling (down sampling the majority class to the size of the minority class) can drive performance gains.
If you're building a predictive model, but are using a more advanced framework (i.e. something that determines sampling parameters via wrapper or a modification of a bagging framework that samples to class equivalence), then I would suggest again feeding the representative sample and letting the algorithm take care of balancing the data for training.

damienfrancois · Answer

There always is the solution to try both approaches and keep the one that maximizes the expected performances.

In your case, I would assume you prefer minimizing false negatives at the cost of some false positive, so you want to bias your classifier against the strong negative prior, and address the imbalance by reducing the number of negative examples in your training set.

Then compute the precision/recall, or sensitivity/specificity, or whatever criterion suits you on the full, imbalanced, dataset to make sure you haven't ignored a significant pattern present in the real data while building the model on the reduced data.

Pasmod Turing · Answer

I think it always depends on the scenario. Using a representative data set is not always the solution. Assume that your training set has 1000 negative examples and 20 positive examples. Without any modification of the classifier, your algorithm will tend to classify all new examples as negative. In some scenarios this is O.K. But in many cases the costs of missing postive examples is high so you have to find a solution for it.

In such cases you can use a cost sensitive machine learning algorithm. For example in the case of medical diagnosis data analysis.

In summary: Classification errors do not have the same cost!

In such cases you can use a cost sensitive machine learning algorithm. For example in the case of medical diagnosis data analysis.

In summary: Classification errors do not have the same cost!

seanv507 · Answer

I think there are two separate issues to consider: Training time, and prediction accuracy.

Take a simple example : consider you have two classes, that have a multivariate normal distribution. Basically, you need to estimate the respective class means and class covariances.  Now the first thing you care about is your estimate of the difference in the class means: but your performance is limited by the accuracy of the worst estimated mean: it's no good estimating one mean to the 100th decimal place - if the other mean is only estimated to 1 decimal place.  So it's a waste of computing resources to use all the data - you can instead undersample the  more common class AND reweight the classes appropriately. ( those computing resources can then be used exploring different input variables etc)

Now the second issue is predictive accuracy: different algorithms use different error metrics, which may or may not agree with your own objectives. For example, logistic regression will penalize overall probability error,  so if  most of your data is from one class, then it will tend to try to improve accurate probability estimates ( e.g. 90 vs 95% probability) of that one class rather than trying to identify the rare class. In that case, you would definitely want to try to reweight to emphasize the rare class ( and subsequently adjust the estimate [by adjusting the bias term] to get the probability estimates realigned)

DaL · Answer

Separate the operational and the training scenarios.

The operational scenario is the one in which your classifier will be measure on.
This is where you should perform well.
Use should have a dataset that is representative of this scenario.

The training scenario is whatever you are doing in order to build a classifier that will perform well on the operational scenario.

Many time the datasets in both scenarios are of the same nature so there is no need to distinct them.
For example, you have some online store so you use for training past usage in order to perform well on the future usage. 
 However, while training you can use a different dataset than a one that represents the operational scenario.
Actually, if you sleep, dream of a classifier, validate it on your operational scenario (this step should be done after waking up), you are just as good as after going the usual machine learning path.

The distinction between operational and training scenarios becomes important when the dataset is imbalanced. Most algorithms won't perform well on such a dataset.

So, don't hesitate to use two datasets -
You can use a balanced dataset for the training.
Once you are done, validate your classifier of the operational dataset.

akunyer · Answer

I would suggest going with the representative dataset because it considers as the snapshot of the real data. But for the sampling, you can try to implement stratified sampling, which can help for the imbalanced class issues. The classification process ( Decision Tree, Bayesian, Rule-based ) might work with this dataset.

Answered by akunyer on February 27, 2021

Should I go for a 'balanced' dataset or a 'representative' dataset?

6 Answers

Add your own answers!

Ask a Question