TransWikia.com

hyperparameter tuning with validation set

Data Science Asked on December 31, 2020

For what I know, and correct me if I am wrong, the use of cross-validation for hyperparameter tuning is not advisable when I have a huge dataset. So, in this case it is better to split the data in training, validation and test set; and then perform the hyperparameter tuning with the validation set.

In the case that I am programming I would like to use scikit, the yeast dataset available at: http://archive.ics.uci.edu/ml/datasets/yeast; and for example to tune the number of epochs.

First, I have separated my training, validation and test set by using the train_test_split twice according to one answer that I saw here. The loss plot that I got is the following for 1500 max iterations:

enter image description here

Then I wanted to use my validation set with a list of different values for the hypeparameter of max iterations. The graph I obtained is the following (with some warning messages of non-convergence for max_iter values less than 1500):

enter image description here

So, I have the first question here. It seems that for a value of max_iter of 3000 the accuracy is 64% approximately, so I should choose that value for the max_iter hyperparameter; is that correct? I can see from the graph that also the red line of 3000 has a less value of loss than the other compared options.

My program so far is the following:

import numpy as np
import pandas as pd
from sklearn import model_selection, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV

def readFile(file):
    head=["seq_n","mcg","gvh","alm","mit","erl","pox","vac","nuc","site"]
    f=pd.read_csv(file,delimiter=r"s+")
    f.columns=head
    return f

def NeuralClass(X,y):
    X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=0.2)
    print (len(X)," ",len(X_train))
    X_tr,X_val,y_tr,y_val=model_selection.train_test_split(X_train,y_train,test_size=0.2)
    mlp=MLPClassifier(activation="relu",max_iter=1500)
    mlp.fit(X_train,y_train)
    print (mlp.score(X_train,y_train))
    plt.plot(mlp.loss_curve_)

    max_iter_c=[500,1000,2000,3000]
    for item in max_iter_c:
        mlp=MLPClassifier(activation="relu",max_iter=item)
        mlp.fit(X_val,y_val)
        print (mlp.score(X_val,y_val))
        plt.plot(mlp.loss_curve_)
    plt.legend(max_iter_c)

def main():
    f=readFile("yeast.data")
    list=["seq_n","site"]
    X=f.drop(list,1)
    y=f["site"]
    NeuralClass(X,y)

Second question, is my approach valid? I have seen a lot of information over the web and all point to cross validation for hyperparameter tuning, but I want to perform it with a validation set.

Any help?

PD. I have tried early stopping and the results are poor compared to the ones obtained with the method I programmed.

Thanks

One Answer

It seems to me that you're manually iterating through the hyper-parameters.

scikit-learn has a number of helper functions that make it easy to iterate through all of the parameters using various strategies: https://scikit-learn.org/stable/modules/grid_search.html#grid-search

Secondly, if I was 'manually' tuning hyper-parameters I'd split my data into 3: train, test and validation (the names aren't important)

I'd change my hyper-parameters, train the model using the training data, test it using the test data. I'd repeat this process until I had the 'best' parameters and then finally run it with the validation data as a sanity check (should have similar scores).

With scikit-learn's helper functions, I just split the data into two parts. Use GridSearchCV with one part and then at the end using the best parameters (stored in the attribute best_estimator_) run a sanity check with the second part.

# define parameter sweep
param_grid = { 'max_iter' : [100, 1000, 10000] }

# define grid search
clf = GridSearchCV(mlp, param_grid, cv=5)

# perform search
clf.fit(X, y)

# best estimator
clf._best_estimator

For info, every estimator and their respective scores are also available as attributes (see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) which means you can check the result of every CV if you wanted to make sure you got the best parameters.

As you can see you can avoid thinking too hard about whether CV was the right approach by using a final validation set as a sanity check.

Answered by fswings on December 31, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP