How to construct pipeline with different alternative transformations for different kind of features in Scikit-learn?

Question

I try to construct a pipeline in sklearn where I do different (in some cases multiple) transformations on different kind (numeric/ordinal/binary nominal/non-binary non-ordinal nominal) features. An additional tweak is that I want to try out different (and sometimes None) kind of the specific transformations in the pipeline.
So far I have tried the following:
preprocess = make_column_transformer(
    (numerical_columns, make_pipeline(RobustScaler(), PolynomialFeatures())),
    (categorical_columns, make_pipeline(OneHotEncoder())),
    (ordinal_columns, "passthrough"),
    (binary_columns, "passthrough"),
)

search_pipeline = Pipeline([("preprocessing", preprocess),
                     ("dimred", PCA()),
                     ("classifier", RandomForestClassifier())])

search_parameters = [
            {"preprocessing__pipeline__robustscaler": [None]},
            {"preprocessing__pipeline__robustscaler": [RobustScaler()]},
            {"preprocessing__pipeline__robustscaler": [StandardScaler()]},

{"preprocessing__pipeline__polynomialfeatures": [None]},    
            {"preprocessing__pipeline__polynomialfeatures": [PolynomialFeatures(degree=2)], "preprocessing__pipeline__polynomialfeatures__interaction_only": [False, True]},

{"dimred": [None]},
            {"dimred": [PCA()], "dimred__n_components": [.95, .75]},
            {"dimred": [LinearDiscriminantAnalysis()], "dimred__n_components": [.95, .75]},

{"classifier": [KNeighborsClassifier(weights="distance")],
             "classifier__n_neighbors": [3, 7, 11]},
            {"classifier": [RandomForestClassifier(n_estimators=100, class_weight="balanced")],
             "classifier__max_depth": [5, 10, None]}          
            ]

As you can see, for example I tried to apply different kind of scaler methods for numerical features:
None
RobustScaler
StandardScaler
However after running GridsearchCV:
CV = GridSearchCV(search_pipeline,
                  search_parameters, cv=5,
                  scoring="f1_weighted",
                  refit=True,
                  n_jobs=-1)

CV.fit(train_X, train_y)

I get error message:
ValueError: Invalid parameter robustscaler for estimator ColumnTransformer(transformers=[('list-1',
                                 ['income', 'reside', 'address', 'wireten',
                                  'tollten', 'equipten', 'cardten', 'longten',
                                  'age', 'employ', 'tenure'],
                                 Pipeline(steps=[('robustscaler',
                                                  RobustScaler()),
                                                 ('polynomialfeatures',
                                                  PolynomialFeatures())])),
                                ('list-2', ['region', 'custcat'],
                                 Pipeline(steps=[('onehotencoder',
                                                  OneHotEncoder())])),
                                ('list-3', ['ed'], 'passthrough'),
                                ('list-4',
                                 ['retire', 'callid', 'gender', 'marital',
                                  'tollfree', 'equip', 'callcard', 'wireless',
                                  'multline', 'voice', 'pager', 'internet',
                                  'callwait', 'forward', 'confer', 'ebill'],
                                 'passthrough')]). Check the list of available parameters with `estimator.get_params().keys()`.

I suspect that the syntax in search_parameters to access specific transformers' specific parameters is incorrect, but what is correct?

How to construct pipeline with different alternative transformations for different kind of features in Scikit-learn?

Add your own answers!

Ask a Question