TransWikia.com

SKLEARN GridSearchCV hinting higher accuracy than Pipeline but with same parameters as Pipeline estimators

Data Science Asked on December 11, 2020

I have pipeline estimators like this:

text_clf = Pipeline([
      ('tfidf', TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')),
  ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=random_state, max_iter=5, tol=None)),
])
text_clf.fit(dataset.data, dataset.target)

Then when evaluating the model accuracy like this

mean = np.mean(predicted == twenty_test.target)
print("mean %0.3f" % mean)

I get score 0.802.

Then when I add the GridSearch to get the best params like this:

parameters = {
     'tfidf__use_idf': (True, False),
     'clf__alpha': (1e-2, 1e-3),
}

gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)
gs_clf = gs_clf.fit(dataset.data, dataset.target)

(Note that i am fitting GridSearch on TRAIN data same as the pipeline – although when i first tried fit it on TEST data by mistake the result was same though.)

It reports this:

grid search best score 0.868
Best params: clf__alpha: 0.001
Best params: tfidf__use_idf: True

Note, that these params are already set on the model but the score is lower in the pipeline.

Is it because the other parameters I set in Pipeline are not kept when using GridSearch or?

Btw how the Grid Search knows the best parameters if I didn’t provide any testing data.

Another problem is, that when I added more parameters to adjust in Grid search and then applied it to Pipeline, the accuracy didn’t change at all. (or changed from 0.802 to 0.805, but GS hinted 0.867)

One Answer

The GridsearchCV object in sklearn does a cross-validation on the data you feed it during your fit. In your case you have specified cv=5: this means GridSearchCV splits your data into train/test splits 5 times and reports on the mean performance over those 5 trials to be 0.868.

You asked why GridSearchCV knows the best parameters without feeding it testing data - this is because it's taking your training data & splitting that into a smaller subset of train/test splits (5 of them to be exact).

You are then evaluating the best model on the testing set and getting accuracy of 0.805. This can happen if the single train/test split you performed split the data in a way that causes some of the more difficult samples to predict to be put in the testing set. If you were to perform the same cross-validation your GridSearchCV performed (say with cv=5) you might find the average performance to be a little higher and closer to the 0.868 it cited.

Answered by Oliver Foster on December 11, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP