Stacking doesn't improve accuracy

Question

I am trying to build a 2 level stacking model in order to tackle a multiclass classification problems with 8 classes.
My base (level 1) models andd their micro f1 scores in the test set are:

Random Forest Classifier (0.51)
XGBoost Classifier (0.54)
LightGBM Classifier (0.54)
Logistic Regression (0.44)
keras neural network (0.57)
keras neural netqork (0.56)

As a level 2 model I use an XGBClassifier which is not tuned.
I use 7 fold cross validation to produce the meta features for level 2 model.
The code I use to produce the meta features for the simple classifiers is:

ntrain = X_train.shape[0]
ntest = X_test.shape[0]
seed = 0 
nfolds = 7
kf = StratifiedKFold(nfolds, random_state=seed)
def get_meta(clf, Χ_train, y_train, Χ_test):
meta_train = np.zeros((ntrain,))
meta_test = np.zeros((ntest,))

for i, (train_index, test_index) in enumerate(kf.split(X_train, y_train)):
    Χ_tr = X_train.iloc[train_index]
    y_tr = y_train.iloc[train_index]
    Χ_te = Χ_train.iloc[test_index]

clf.train(Χ_tr, y_tr)

meta_train[test_index] = clf.predict(Χ_te)

clf.fit(X_train,y_train)
meta_test = clf.predict(X_test)
return meta_train.reshape(-1, 1), meta_test.reshape(-1, 1)

and for the keras neural networks is:

def get_meta_keras(clf, Χ_train, y_train, Χ_test, epochs = 200, batch_size = 70, class_weight=class_weights):
meta_train = np.zeros((ntrain,))
meta_test = np.zeros((ntest,))

encoder = LabelEncoder()
encoder.fit(y_train)
encoded_Y = encoder.transform(y_train)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)

for i, (train_index, test_index) in enumerate(kf.split(X_train, y_train)):
    Χ_tr = X_train.iloc[train_index]
    y_tr = dummy_y[train_index]
    Χ_te = Χ_train.iloc[test_index]

clf.fit(Χ_tr, y_tr, epochs = epochs, batch_size = batch_size, class_weight=class_weights)

meta_train[test_index] = clf.predict_classes(Χ_te)

clf.fit(X_train, dummy_y, epochs = epochs, batch_size = batch_size, class_weight=class_weights)
meta_test = clf.predict_classes(X_test)
return meta_train.reshape(-1, 1), meta_test.reshape(-1, 1)

My final micro f1 score is 0.54 which is less than the best of my base models score. My models are uncorrelated (corr<0.55). I tried to add more simple models like knn, naive bayes etc but the score fell even more.
Why doesn't my stacking approach improve the score ?

Erwan · Answer

I'm an old man who likes simple things :D

So I would try a few more basic options for the level 2 model:

majority voting (it hardly gets simpler than that!) 
linear regression
single decision tree 
SVM

Apart from the fact that I'm old, there are two reasons why these could be useful:

smaller risk of overfitting: If the level 2 model is too complex it tends to overfit, and in my experience there's a high price to pay for that in terms of performance at level 2 when stacking learners.
scrutiny: one can easily investigate what happens in the combination of predictions.

Daniel Chepenko · Answer

Although there is no data provided I'll try to share my ideas.
Stacking
We stack models by making several predictions with hold-out data sets and then collecting these predictions to form a new data set, where you can fit a new model on it
Often times the stacked model (also called 2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly
Model diversity
But one of the most important thong in stacking is model diversity, how different a model is to each other. You should consider what is the new information each model brings into the table?
It will find why one of your model is good, and when another one is bad or fairly weak.
You don't need to worry too much to make all the models really strong.
Therefore, what you really need to focus is what information does this model brings, even though it is generally weak? Such models bring in new information that the meta model could leverage.
Normally, you introduce diversity from two forms:

By choosing a different algorithm or by . Which makes sense, certain
algorithms capitalize on different relationships within the data.
For example, a linear model will focus on a linear relationship, a
non-linear model can capture better a non-linear relationships. So
predictions may come a bit different

Running the same model, but you try to run it on different
transformation of input data, either less features or completely
different transformation. For example, in one data set you may treat
categorical features as one-hot encoding. In another, you may just
use label encoding, and the result will probably produce a model
that is very different.

So you need to test different models. As @Aditya said - the improvement isn't guaranteed either ways until and unless you have strong baselines

Stacking doesn't improve accuracy

2 Answers

Add your own answers!

Ask a Question