Noise Elimination with majority vote filtering

Question

I have a dataset with label noise which I wan't to clean with majority/consensus vote filtering. This will mean I will divide the data in K-Folds and train an ensemble model. Than using the predictions on the data I will remove rows, which are missclassified by most (majority voting) or all (consensus voting).

I have a few questions on which I can't find the answers elsewhere:

how to decide what models to use in the ensemble
the dataset is very imbalanced. Do I need to do upsampling in the majority voting?
do I do hyperparameter tuning in the different models, or just use standard settings?

Erwan · Answer

I have a few questions on which I can't find the answers elsewhere

It's probably because there is no simple answer to these three questions :)

I doubt there's any state of the art approach, in such cases I simply try to determine the answer to these questions empirically. Basically I create a list of hyper-parameters including the type of algorithm, the algorithm-specific hyper-parameters and any other potentially relevant option. The goal is to determine the optimal combination of values for the set of parameters. If practical I run all the combinations and select the best one. If not practical, I use a simple genetic algorithm to find an optimal combination. Of course it's suitable only if you have a dataset large enough and if the training/testing process is not too computer-intensive. You also need to be very careful about overfitting by using cross-validation and re-sampling.

Noise Elimination with majority vote filtering

One Answer

Add your own answers!

Ask a Question