Data Science Asked by Alex P on November 11, 2020
I have thousands of data sources generating data from similar type of hardware. The different sources create different dynamics in the datasets though!
Even though the features are the same the data sets have very diverse characteristics.
I am working on a multiclass classification problem trying to see how much specific models can be used to tackle that domain.
The number of classes differ on different data sources so different models need to be built. That means that in the end I have many many different models to evaluate. Similar input but the number of classes to be predicted at the output is differrent.
Since this is a multiclass classification problem things like confusion matrices are used and multiple ROC curves.
Now I am trying to see in more details what might be causing poor performance in the poorest performing models. Typically the reasons are:
1.not enough measurements
2.heavily imbalanced datasets
3. a combination of 1 and 2
The problem is that I do not have a definition on a multiclass problem what is an imbalanced dataset. Ideally if I could use a specific “rule” to label my datasets, I would be able to see things like correlation of imbalanced set and precision.
When it comes to imbalanced dataset for multiple class a threshold is not enough, since is the distribution of the available measurements between the classes that is important. For that I have no idea on how to handle that.
How would you handle this case ?
Thanks a lot for reading this and contributing to this community.
Regards
Alex
The problem of an imbalanced dataset is the problem of generative classifiers that use the prior probability for calculating the predicted label. As the labels have a lower prior they get a lower probability.
There are several ways to cope with imbalanced datasets:
Answered by Tolik on November 11, 2020
By definition, a balanced dataset will have an equal number of data points in all the classes. All other datasets are deemed imbalanced.
You can very well use an imbalanced dataset to train your ML model as long as the predictions are accurate. If not, then go for undersampling or oversampling depending on your use case. This blog covers it: https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28)
Answered by Faiz Kidwai on November 11, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP