Data Science Asked by Zannatul Ferdaus on August 26, 2021
I am analysing a technique "Sherlock" – a semantic type of column detecting technique wherein training dataset too many samples of a specific type are limited up to 15K and too few occurring samples exist less than 1K per class also excluded. What is the reason behind this? What are the disadvantages having too much or very few samples of a specific type in the input of a neural network?
Theoretically speaking, there aren't any disadvantages to having too much or too few data. It will only reflect in the overall performance of your model. Based on the Sherlock paper, it seems that it's a choice they made for their preprocessing. This is their explanation:
Certain types occur more frequently in the VizNet corpus than others. For example, description and city are more common than collection and continent. To address this heterogeneity, we limited the number of columns to at most 15K per class and excluded the 10% types containing less than 1K columns
They did this to reduce the overall imbalance of their dataset.
Answered by Valentin Calomme on August 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP