TransWikia.com

When to consider examples as duplicates in a classification problem?

Data Science Asked by Salvatore on May 21, 2021

I’m trying to train a model for a binary classification problem and I’m trying to understand if my model is biased from my data or if my results are valid.

I have a pandas dataframe as output of my preprocessing/cleaning/transforming phase.

This dataset is composed by the following columns:
[ID, Column_X, Column_1, Column_2,….,Column_n, Target]

For a single ID there are multiple examples and these examples differ only for Column_X.

When I train the model (after the split phase in train/validation/test) I use all the above columns without the ID and Target ones.

Should I consider the examples with the same ID as duplicates and then my results on test set invalid due to the fact that the model has probably already seen a very similar example (different only for the column Column_X) during the training phase?

I am currently using CatBoost algorithm because it seems to have the best results after having tried with also Logistic Regression, Random Forest and Decision Tree algorithms.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP