TransWikia.com

Stacking doesn't improve accuracy

Data Science Asked by fitting on February 24, 2021

I am trying to build a 2 level stacking model in order to tackle a multiclass classification problems with 8 classes.
My base (level 1) models andd their micro f1 scores in the test set are:

  1. Random Forest Classifier (0.51)
  2. XGBoost Classifier (0.54)
  3. LightGBM Classifier (0.54)
  4. Logistic Regression (0.44)
  5. keras neural network (0.57)
  6. keras neural netqork (0.56)

As a level 2 model I use an XGBClassifier which is not tuned.
I use 7 fold cross validation to produce the meta features for level 2 model.
The code I use to produce the meta features for the simple classifiers is:

ntrain = X_train.shape[0]
ntest = X_test.shape[0]
seed = 0 
nfolds = 7
kf = StratifiedKFold(nfolds, random_state=seed)
def get_meta(clf, Χ_train, y_train, Χ_test):
meta_train = np.zeros((ntrain,))
meta_test = np.zeros((ntest,))

for i, (train_index, test_index) in enumerate(kf.split(X_train, y_train)):
    Χ_tr = X_train.iloc[train_index]
    y_tr = y_train.iloc[train_index]
    Χ_te = Χ_train.iloc[test_index]

    clf.train(Χ_tr, y_tr)

    meta_train[test_index] = clf.predict(Χ_te)

clf.fit(X_train,y_train)
meta_test = clf.predict(X_test)
return meta_train.reshape(-1, 1), meta_test.reshape(-1, 1)

and for the keras neural networks is:

def get_meta_keras(clf, Χ_train, y_train, Χ_test, epochs = 200, batch_size = 70, class_weight=class_weights):
meta_train = np.zeros((ntrain,))
meta_test = np.zeros((ntest,))

encoder = LabelEncoder()
encoder.fit(y_train)
encoded_Y = encoder.transform(y_train)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)


for i, (train_index, test_index) in enumerate(kf.split(X_train, y_train)):
    Χ_tr = X_train.iloc[train_index]
    y_tr = dummy_y[train_index]
    Χ_te = Χ_train.iloc[test_index]

    clf.fit(Χ_tr, y_tr, epochs = epochs, batch_size = batch_size, class_weight=class_weights)

    meta_train[test_index] = clf.predict_classes(Χ_te)

clf.fit(X_train, dummy_y, epochs = epochs, batch_size = batch_size, class_weight=class_weights)
meta_test = clf.predict_classes(X_test)
return meta_train.reshape(-1, 1), meta_test.reshape(-1, 1)

My final micro f1 score is 0.54 which is less than the best of my base models score. My models are uncorrelated (corr<0.55). I tried to add more simple models like knn, naive bayes etc but the score fell even more.
Why doesn’t my stacking approach improve the score ?

2 Answers

I'm an old man who likes simple things :D

So I would try a few more basic options for the level 2 model:

  • majority voting (it hardly gets simpler than that!)
  • linear regression
  • single decision tree
  • SVM

Apart from the fact that I'm old, there are two reasons why these could be useful:

  • smaller risk of overfitting: If the level 2 model is too complex it tends to overfit, and in my experience there's a high price to pay for that in terms of performance at level 2 when stacking learners.
  • scrutiny: one can easily investigate what happens in the combination of predictions.

Answered by Erwan on February 24, 2021

Although there is no data provided I'll try to share my ideas.

Stacking

We stack models by making several predictions with hold-out data sets and then collecting these predictions to form a new data set, where you can fit a new model on it

Often times the stacked model (also called 2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly

Model diversity

But one of the most important thong in stacking is model diversity, how different a model is to each other. You should consider what is the new information each model brings into the table?

It will find why one of your model is good, and when another one is bad or fairly weak. You don't need to worry too much to make all the models really strong. Therefore, what you really need to focus is what information does this model brings, even though it is generally weak? Such models bring in new information that the meta model could leverage.

Normally, you introduce diversity from two forms:

  1. By choosing a different algorithm or by . Which makes sense, certain algorithms capitalize on different relationships within the data. For example, a linear model will focus on a linear relationship, a non-linear model can capture better a non-linear relationships. So predictions may come a bit different

  2. Running the same model, but you try to run it on different transformation of input data, either less features or completely different transformation. For example, in one data set you may treat categorical features as one-hot encoding. In another, you may just use label encoding, and the result will probably produce a model that is very different.

So you need to test different models. As @Aditya said - the improvement isn't guaranteed either ways until and unless you have strong baselines

Answered by Daniel Chepenko on February 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP