Data Science Asked on June 30, 2021
I am imputing my data using simple imputer from sklearn. i want to test many different ways of applying transformations to the data. i.e for logisitcic regression i would like to
then for using xgboost i would like to:
i have been playing with sklearn pipeline and i would like to know how i can pass the custom imputers through the pipeline? e.g:
logistic_pipeline = Pipeline( steps = [('imputer', SimpleImputer(strategy = 'most frequent') ),
( 'std_scaler', StandardScaler() ),
( 'model', LinearRegression() )] )
but how do i incorprate the following function into it where i am replacing infs from the training datase (df) with the max of that column . then using this max to populate it into the test.. how can i do this using pipeline?
def replace_pos_inf(df, dftest, numeric_features):
for col in df[numeric_features].columns:
m = df.loc[df[col] != np.inf, col].max()
df[col].replace(np.inf,m,inplace=True)
dftest[col].replace(np.inf,m,inplace=True)
for col in df[numeric_features].columns:
mini = df.loc[df[col] != -np.inf, col].min()
df[col].replace(-np.inf,mini,inplace=True)
dftest[col].replace(-np.inf,mini,inplace=True)
return df,dftest
Since you want to save the training min/max and use those to replace inf's in the test set, you need a custom transformer. To build a robust transformer, you should use some of sklearn's validation functions. And it's best to work in numpy, since as you point out an earlier transformer in a pipeline will have already converted an input dataframe to an nparray. (You could stick with dataframes, and either convert in your transformer (losing some efficiency), or make sure your transformer always comes first in a pipeline (alright if you're the only one using your code).)
Then, here's a simple version. I'm opting not to have the numeric_features
as input; you're probably better off using a ColumnTransformer
to do that kind of selection for you.
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
from sklearn.base import TransformerMixin, BaseEstimator
class ReplaceInf(TransformerMixin, BaseEstimator):
'''Replace +-np.inf with the max/min finite values in each column.
Attributes
----------
mins_ : np.ndarray
Per-column minimum finite values.
maxs_ : np.ndarray
Per-column maximum finite values.
'''
def fit(self, X, y=None):
# validate and convert if possible:
X = check_array(X, force_all_finite=False)
# using `where=np.isfinite(X)`, nan's won't affect the min/max calculation,
# and using `clip` to transform will preserve nan's as well.
self.mins_ = np.amin(X, axis=0, where=np.isfinite(X), initial=np.inf)
self.maxs_ = np.amax(X, axis=0, where=np.isfinite(X), initial=-np.inf)
return self
def transform(self, X):
X = check_array(X, force_all_finite=False)
return np.clip(X, self.mins_, self.maxs_)
Answered by Ben Reiniger on June 30, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP