TransWikia.com

Solutions for Labelling Training Data for Binary Classification Problems

Data Science Asked on February 11, 2021

I have a huge dataset for which I am trying to use an 80-20 (Holdout method) approach to train and test my model. However, the dataset I have been given has 6m rows. The objective is to train+test+validate the model before using live data traffic for real-time predictions.

The expected result here is "It’s not corrupted with 97% accuracy" which is implementation details and output of some Jupyter notebook etc.

My Question is – Is there any alternatives than manually labelling such a big dataset?

By manually labelling – I mean a human (or a group) going through all the 6m rows(!). Also, not all input strings have identical contents so it’s hard to just push it through some script/csv and automate it. But I am trying to understand if this is the ONLY way.

One Answer

Ofcourse not. Here is a simple possible solution.

Do unsupervised learning. If you do it good and efficiently you will only see these two groups in your data (binary classification). And your silhuette score will be high. Hence you can automatically than label these groups/clusters.

Answered by Noah Weber on February 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP