Large dataset - ANN

Question

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)

and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)

Wickkiey · Answer

To avoid overfit,
While model building with MLPClassifier,

Use early_stopping = True. This stops the training when there is
no improvement with validation data.
Using default node size is fine for most of the cases, until it is
has large number of features.
Since you have more data, you split with train, test and validation
and validate the scores.
Check with various metrics (f1_score, precision, recall etc.). This
is highly useful when it comes with imbalanced data set.
If you are highly concerned about over-fitting, you can explore cross_val_predict. The error standard deviation shows, how well the model will work on unseen data.

You will get more things, related to the data you have.

Large dataset - ANN

One Answer

Add your own answers!

Ask a Question