Solutions for Labelling Training Data for Binary Classification Problems

Question

I have a huge dataset for which I am trying to use an 80-20 (Holdout method) approach to train and test my model. However, the dataset I have been given has 6m rows. The objective is to train+test+validate the model before using live data traffic for real-time predictions.
The expected result here is "It's not corrupted with 97% accuracy" which is implementation details and output of some Jupyter notebook etc.
My Question is - Is there any alternatives than manually labelling such a big dataset?
By manually labelling - I mean a human (or a group) going through all the 6m rows(!). Also, not all input strings have identical contents so it's hard to just push it through some script/csv and automate it. But I am trying to understand if this is the ONLY way.

Noah Weber · Answer

Ofcourse not. Here is a simple possible solution.
Do unsupervised learning. If you do it good and efficiently you will only see these two groups in your data (binary classification). And your silhuette score will be high. Hence you can automatically than label these groups/clusters.

Solutions for Labelling Training Data for Binary Classification Problems

One Answer

Add your own answers!

Ask a Question