TransWikia.com

How to impute using simple imputer (custom function)

Data Science Asked on June 30, 2021

I am imputing my data using simple imputer from sklearn. i want to test many different ways of applying transformations to the data. i.e for logisitcic regression i would like to

  • remove nans and replace with mode
  • replace +infs with max and -infs with min
  • use standard scaler.

then for using xgboost i would like to:

  • simply replace -infs/+infs with very large or -ve large numbers.

i have been playing with sklearn pipeline and i would like to know how i can pass the custom imputers through the pipeline? e.g:

logistic_pipeline = Pipeline( steps = [('imputer', SimpleImputer(strategy = 'most frequent') ),
                                  ( 'std_scaler', StandardScaler() ),
                        ( 'model', LinearRegression() )] )

but how do i incorprate the following function into it where i am replacing infs from the training datase (df) with the max of that column . then using this max to populate it into the test.. how can i do this using pipeline?

def replace_pos_inf(df, dftest, numeric_features):

    for col in df[numeric_features].columns:
        m = df.loc[df[col] != np.inf, col].max()
        df[col].replace(np.inf,m,inplace=True)
        dftest[col].replace(np.inf,m,inplace=True)


    for col in df[numeric_features].columns:
        mini = df.loc[df[col] != -np.inf, col].min()
        df[col].replace(-np.inf,mini,inplace=True)
        dftest[col].replace(-np.inf,mini,inplace=True)

    return df,dftest

One Answer

Since you want to save the training min/max and use those to replace inf's in the test set, you need a custom transformer. To build a robust transformer, you should use some of sklearn's validation functions. And it's best to work in numpy, since as you point out an earlier transformer in a pipeline will have already converted an input dataframe to an nparray. (You could stick with dataframes, and either convert in your transformer (losing some efficiency), or make sure your transformer always comes first in a pipeline (alright if you're the only one using your code).)

Then, here's a simple version. I'm opting not to have the numeric_features as input; you're probably better off using a ColumnTransformer to do that kind of selection for you.

from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
from sklearn.base import TransformerMixin, BaseEstimator

class ReplaceInf(TransformerMixin, BaseEstimator):
    '''Replace +-np.inf with the max/min finite values in each column.
    Attributes
    ----------
    mins_ : np.ndarray
        Per-column minimum finite values.
    maxs_ : np.ndarray
        Per-column maximum finite values.
    '''

    def fit(self, X, y=None):
        # validate and convert if possible:
        X = check_array(X, force_all_finite=False)
        # using `where=np.isfinite(X)`, nan's won't affect the min/max calculation,
        #  and using `clip` to transform will preserve nan's as well.
        self.mins_ = np.amin(X, axis=0, where=np.isfinite(X), initial=np.inf)
        self.maxs_ = np.amax(X, axis=0, where=np.isfinite(X), initial=-np.inf)
        return self

    def transform(self, X):
        X = check_array(X, force_all_finite=False)
        return np.clip(X, self.mins_, self.maxs_)

Answered by Ben Reiniger on June 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP