TransWikia.com

How to determine if dataset is a suitable representation of the context?

Data Science Asked on June 11, 2021

How can I determine if the data I have collected is a good enough representation of the context? For example, I am working on an object detection system and have been building an image dataset. How can I know if my dataset represents the task? For example, I need to account for instances where the object is close up and far away. But what about contexts/situations I have missed, or not even considered? Is there an art to building datasets?

2 Answers

Before collecting the data, domain knowledge is used to determine plausible variations the task may present. It is not necessary to capture in data all the variations experts may point out (for example, some may be syntheticaly constructed). But domain knowledge indeed dictates the variations a certain task may have to take account for.

That being said, one then gathers the data trying to match plausible variations in the percent that domain experts dictate. Again some may even be syntheticaly constructed a-posteriori. Then we have a representative dataset.

In case the dataset proves inadequate, this means domain knowledge needs to be updated with new information about the task.

Correct answer by Nikos M. on June 11, 2021

In general the point of reference is the state of the art: very often there have been people who have built similar datasets in the past, possibly in a different domain or with a different application in mind. Their work (typically published academic papers and/or code) can be used as a baseline: how did they proceed, which problems did they deal with and how, were there any flaws found later with the data, etc.

When the decisions made during the process are supported by the state of the art, you have more arguments to defend the quality of your dataset. Of course it's not a guarantee, but it's a kind of insurance: if it turns out that there's a flaw in your dataset, it can't be held against you unless there was a way to foresee the issue based on similar works.

Answered by Erwan on June 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP