When to consider examples as duplicates in a classification problem?

Data Science Asked by Salvatore on May 21, 2021

I’m trying to train a model for a binary classification problem and I’m trying to understand if my model is biased from my data or if my results are valid.

I have a pandas dataframe as output of my preprocessing/cleaning/transforming phase.

This dataset is composed by the following columns:
[ID, Column_X, Column_1, Column_2,….,Column_n, Target]

For a single ID there are multiple examples and these examples differ only for Column_X.

When I train the model (after the split phase in train/validation/test) I use all the above columns without the ID and Target ones.

Should I consider the examples with the same ID as duplicates and then my results on test set invalid due to the fact that the model has probably already seen a very similar example (different only for the column Column_X) during the training phase?

I am currently using CatBoost algorithm because it seems to have the best results after having tried with also Logistic Regression, Random Forest and Decision Tree algorithms.

catboost classification feature engineering machine learning

Add your own answers!

Ask a Question

Get help from others!