Data Science Asked on January 15, 2021
Let’s say I have a categorical feature (cat
):
import random
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
random.seed(1234)
y = random.choices([1, 0], weights=[0.2, 0.8], k=100)
cat = random.choices(["A", "B", "C"], k=100)
df = pd.DataFrame.from_dict({"y": y, "cat": cat})
and I want to use target encoding with regularisation using CV like below:
X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42)
df_train = pd.concat([X_train, y_train], axis=1).sort_index()
df_train["kfold"] = -1
idx = df_train.index
df_train = df_train.sample(frac=1)
skf = StratifiedKFold(n_splits=5)
for fold_id, (train_id, val_id) in enumerate(skf.split(X=df_train.drop("y", axis=1), y=df_train["y"])):
df_train.iloc[val_id, df_train.columns.get_loc("kfold")] = fold_id
df_train = df_train.loc[idx]
encoded_dfs = []
for fold in df_train["kfold"].unique():
df_train_cv = df_train[df_train["kfold"] != fold].copy()
df_val_cv = df_train[df_train["kfold"] == fold].copy()
means = df_train_cv.groupby('cat')['y'].mean()
df_val_cv['cat'] = df_val_cv['cat'].map(means)
encoded_dfs.append(df_val_cv)
encoded_dfs = pd.concat(encoded_dfs, axis=0).sort_index()
encoded_dfs.drop('kfold', axis=1, inplace=True)
However, I have some doubts about the way how I should then encode test set. As there is no single mapping deduced from train set I think we should use the whole train set to fit the encodings and then use it on test set:
means = df_train.groupby('cat')['y'].mean()
X_test['cat'] = X_test['cat'].map(means)
It seems to be the natural way to do it as, in fact, this is exactly mimicked by CV step. But the results of the model I got were off and it made me think if I am missing something. Please note that, for sake of simplicity, I omitted additional smoothing I did as well. Therefore, my question is: is it the correct way to encode test set?
I have some doubts about the way how I should then encode test set. As there is no single mapping deduced from train set I think we should use the whole train set to fit the encodings and then use it on test set
Yep, that seems fine, they way that you do it there its a bit more complicated than using a pipeline. The idea of splitting into train and test is mimicking how the model will behave in production/unseen data. Doing target encoding with the test, is doing data leakage and getting a miss representation of how the model will behave in production. So you get the target values in train and then move to test.
If you do this, and then you have a category in test that is unseen, it will through an error. If you have a look at the target encoding library of category encoders, you can deal with this.:
handle_missing: str options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
You can handle it in different ways, the best is depending in your problem. The default is returning the target mean.
They best practice to do is to create a pipeline where the target encoding is a step(transformer). This will allow you to do CV, evaluate your model on test and many other functionalities. (Here a tutorial on how to)
A code snippet:
import random
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from category_encoders.target_encoder import TargetEncoder
from category_encoders.m_estimate import MEstimateEncoder
from sklearn.linear_model import ElasticNet,LogisticRegression
random.seed(1234)
y = random.choices([1, 0], weights=[0.2, 0.8], k=100)
cat = random.choices(["A", "B", "C"], k=100)
df = pd.DataFrame.from_dict({"y": y, "cat": cat})
X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42)
skf = StratifiedKFold(n_splits=5)
clf = LogisticRegression()
te = TargetEncoder()
pipe = Pipeline(
[
("te", te),
("clf", clf),
]
)
#Grid to serch for the hyper parameters
pipe_grid = {
"te__smoothing": [0.0001],
}
# Instantiate the grid
pipe_cv = GridSearchCV(
pipe,
param_grid=pipe_grid,
n_jobs=-1,
cv=skf,
)
pipe_cv.fit(X_train, y_train)
# Add some unseen category to the test.
X_test['cat'] = 'UUUUU'
pipe_cv.predict(X_test)
Note that the code is not optimal but it should show you how to deal with this problem of doing target encoding with the train and test using a pipeline, and working with unseen data :)
Note that the category has been assigned randomly. So the model detects that the best is predicting the most frequent class. If you change for ElasticNet (a regressor) you will get the mean.
If you take out the unseen category assignation to test you will still get the same results
Answered by Carlos Mougan on January 15, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP