Is using samples from the same person in both trainset and testset considers being a data leakage?

Question

Suppose a neural network is built for a binary classification problem such as recognize the face as a smiley face or not, by using a dataset of 1000 persons and each person has ten images of his face.
If the dataset randomly spilt into trainset and testset by a ratio of 70:30, in this case, there is a big chance face image of same persons will be used in both the trainset and testset, so is this considered to be data leakage (train-test contamination)?

Benji Albert · Accepted Answer

Yes, this is a form of data leakage. The testing data should not be linked to the training data in any way.
Another way to think of it is, if someone were to try replicating your results with their own test set, would your test set have given you an advantage such that your results are generally better than theirs?

Is using samples from the same person in both trainset and testset considers being a data leakage?

One Answer

Add your own answers!

Ask a Question