TransWikia.com

Large dataset - ANN

Data Science Asked by tempx on March 21, 2021

I am trying to classify around 400K data with 13 attributes. I have used python sklearn’s SVM package, but it didn’t work, and then I learned that SVM’s are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:

MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)

and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)

One Answer

To avoid overfit,

While model building with MLPClassifier,

  1. Use early_stopping = True. This stops the training when there is no improvement with validation data.
  2. Using default node size is fine for most of the cases, until it is has large number of features.
  3. Since you have more data, you split with train, test and validation and validate the scores.
  4. Check with various metrics (f1_score, precision, recall etc.). This is highly useful when it comes with imbalanced data set.
  5. If you are highly concerned about over-fitting, you can explore cross_val_predict. The error standard deviation shows, how well the model will work on unseen data.

You will get more things, related to the data you have.

Answered by Wickkiey on March 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP