Data Science Asked on September 5, 2021
Suppose a neural network is built for a binary classification problem such as recognize the face as a smiley face or not, by using a dataset of 1000 persons and each person has ten images of his face.
If the dataset randomly spilt into trainset and testset by a ratio of 70:30, in this case, there is a big chance face image of same persons will be used in both the trainset and testset, so is this considered to be data leakage (train-test contamination)?
Yes, this is a form of data leakage. The testing data should not be linked to the training data in any way.
Another way to think of it is, if someone were to try replicating your results with their own test set, would your test set have given you an advantage such that your results are generally better than theirs?
Correct answer by Benji Albert on September 5, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP