Data Science Asked by Salvatore on May 21, 2021
I’m trying to train a model for a binary classification problem and I’m trying to understand if my model is biased from my data or if my results are valid.
I have a pandas dataframe as output of my preprocessing/cleaning/transforming phase.
This dataset is composed by the following columns:
[ID, Column_X, Column_1, Column_2,….,Column_n, Target]
For a single ID there are multiple examples and these examples differ only for Column_X.
When I train the model (after the split phase in train/validation/test) I use all the above columns without the ID and Target ones.
Should I consider the examples with the same ID as duplicates and then my results on test set invalid due to the fact that the model has probably already seen a very similar example (different only for the column Column_X) during the training phase?
I am currently using CatBoost algorithm because it seems to have the best results after having tried with also Logistic Regression, Random Forest and Decision Tree algorithms.
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP