TransWikia.com

Noise Elimination with majority vote filtering

Data Science Asked by Jeuszt on October 5, 2021

I have a dataset with label noise which I wan’t to clean with majority/consensus vote filtering. This will mean I will divide the data in K-Folds and train an ensemble model. Than using the predictions on the data I will remove rows, which are missclassified by most (majority voting) or all (consensus voting).

I have a few questions on which I can’t find the answers elsewhere:

  • how to decide what models to use in the ensemble

  • the dataset is very imbalanced. Do I need to do upsampling in the majority voting?

  • do I do hyperparameter tuning in the different models, or just use standard settings?

One Answer

I have a few questions on which I can't find the answers elsewhere

It's probably because there is no simple answer to these three questions :)

I doubt there's any state of the art approach, in such cases I simply try to determine the answer to these questions empirically. Basically I create a list of hyper-parameters including the type of algorithm, the algorithm-specific hyper-parameters and any other potentially relevant option. The goal is to determine the optimal combination of values for the set of parameters. If practical I run all the combinations and select the best one. If not practical, I use a simple genetic algorithm to find an optimal combination. Of course it's suitable only if you have a dataset large enough and if the training/testing process is not too computer-intensive. You also need to be very careful about overfitting by using cross-validation and re-sampling.

Answered by Erwan on October 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP