Logic Check: Building a SKLearn Pipeline

Question

I am new to the concept of building a pipeline in SKLearn and would appreciate some sense-checking to ensure that I am not leaking info from my training sets into my test set.

Background:

I have a sparse, high-dimensional data-set (370x1000) with a continuous variable as the target. At present, I have been running a random forest regression on all the features with a 90/10 split, followed by parameter tuning via grid search on the training set, followed by 5-fold cross validation on the optimized model (with the entire dataset).

Problems with this approach:

As I understand the situation, there are a number of things I am doing that might be harming the model and introducing undesired bias. Specifically, my concerns are:

As I am tuning parameters only on the initial train/test split, I am not accounting for other split combinations that arise during the K-Fold CV. Might optimal settings for fold 1 be different for fold 2? Intuitively I would assume so.
I am not doing any feature selection that might remove otherwise redundant features and shorten my feature-space (I know RF is generally quite good with high-dimensional spaces but I would still like to try). Some suggestions I have read have included removing features with very low variance. But I find myself in the same conundrum as above: if I remove low variance features from the original training set I am not accounting for other combinations during K-fold. Comparatively, if I remove all low-var features prior to splitting the data I am surely leaking info between the train/test states.
An alternative approach I have seen is recursive feature selection (with CV as per SKLearn) - this looks promising, as I think it means that it will partition the data-set in folds and conduct RFE on each fold, presumably giving me an averaged score of the best number of features to keep.

My possible solution:

I have been doing some reading around Pipelines in SKLearn and think that might be the way to go. My understanding is that an advantage of a pipeline is that i can stack transforms together, preserving the individual folds and allowing me to address the problems I have detailed above. What I am considering, and what I would appreciate sense-checking from anyone with more experience, is the following:

As the data-set is small, I would not split the data-set in the conventional train/test split manner, but would use K-Fold across the whole data-set.
Run a RF (using default params) with K-Fold to get a baseline level of performance.
Create a pipeline whereby I (3.1) create folds, (3.2) and then within each fold find the optimal number of features to keep, (3.3) tune hyper-parameters for that fold, and finally (3.4) predict the values in the test fold.

As you might be able to tell I am struggling to get to grips with what order things should go in, and whether step 3 is actually what a Pipeline does. If someone can provide pointers/recommendations/corrections it would be appreciated.

Brian Spiering · Accepted Answer

You are on the right path. It appears you might have analysis paralysis. You should start building, then see what works and what does not work.
Here is code to get you started:
from sklearn.ensemble             import RandomForestRegressor 
from sklearn.feature_selection    import VarianceThreshold
from sklearn.model_selection      import GridSearchCV
from sklearn.pipeline             import Pipeline

regressor = Pipeline([('vt', VarianceThreshold(threshold=0.0)),
                      ('rf', RandomForestRegressor())])

gsvc = GridSearchCV(regressor, {'vt__threshold':    [0, .1, .5, .7, 1],
                                'rf__n_estimators': [10, 100, 1_000],
                                'rf__max_depth':    [2, 10, 100]})

Logic Check: Building a SKLearn Pipeline

One Answer

Add your own answers!

Ask a Question