TPOT machine learning

Question

I trained a regression TPOT algorithm on Google Colab, where the output of the TPOT process is some boiler plate Python code as shown below.
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = 
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -4.881434802676966
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=ExtraTreesRegressor(bootstrap=False, max_features=0.9000000000000001, min_samples_leaf=1, min_samples_split=20, n_estimators=100)),
    ExtraTreesRegressor(bootstrap=True, max_features=0.9000000000000001, min_samples_leaf=6, min_samples_split=13, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Would anyone know what is the Sklearn pipeline process like how does this work? When I butter up the boiler plate code and run it with my data set in IPython I can see this output from the pipeline process, what is this all doing?
Pipeline(steps=[('stackingestimator-1',
                 StackingEstimator(estimator=ExtraTreesRegressor(max_features=0.6500000000000001,
                                                                 min_samples_leaf=19,
                                                                 min_samples_split=14,
                                                                 random_state=1))),
                ('maxabsscaler', MaxAbsScaler()),
                ('stackingestimator-2',
                 StackingEstimator(estimator=ExtraTreesRegressor(max_features=0.4,
                                                                 min_samples_leaf=3,
                                                                 min_samples_split=7,
                                                                 random_state=1))),
                ('adaboostregressor',
                 AdaBoostRegressor(learning_rate=0.001, loss='exponential',
                                   n_estimators=100, random_state=1))])

The results look good just curious about how the pipelines processes work, any tips or links to tutorials greatly appreciated. I thought this machinelearningmastery tutorial is also somewhat useful for anyone interested in learning more about TPOT.

Ben Reiniger · Accepted Answer

The documentation for StackingEstimator is surprisingly poor, but it's relatively simple: fit the estimator on the data, and tack its predictions onto the dataset as a new feature.  Source, github issue.
So, your pipeline fits an ExtraTreesRegressor on the original inputs, and appends its predictions to the dataset going forward.  The data (original + 1st predictions) get scaled, then another ExtraTreesRegressor is fit (on scaled original and 1st preds, and with different hyperparameters), its predictions also getting tacked onto the dataset.  Finally, an AdaBoostRegressor is fit on scaled-original + scaled-1st-preds + 2nd-preds.

TPOT machine learning

One Answer

Add your own answers!

Ask a Question