I do feature engineering on the full dataset, is this wrong?

Question

I am aiming to predict the number of days it takes to sell a given property, let's call this variable "DaysForSale" - in short DfS
Using the DfS I created a variable called "median_dfs_grouped_street_name" which returns the median days it takes to sell a property for the different streets available in the dataset. (The street names are all categorized).
After this, I do my train/test split and run my Random Forest method.
Using the feature_imporatances function I see that the new feature is the second most important, which makes me wonder if this is the correct approach?
I have two questions:

Is it wrong to develop features using the target variable?
Is it wrong to do feature engineering on the full dataset?

Sammy · Accepted Answer

Is it wrong to develop features using the target variable?

Not necessarily. It is called "target encoding" or "Mean encoding" and can be very useful. In your case you could, for example, use the DfS of your train data to calculate a median value per street. But you need to carefully design the target encoding to avoid overfitting (there are different strategies to do that - see below link). And for the test data you can only use the target encoding based on your train data.
The Coursera course "How to Win a Data Science Competition: Learn from Top Kagglers" has great content on target/mean encoding to be found here.

Is it wrong to do feature engineering on the full dataset?

Not necessarily. As pointed out in Nicolas' answer you need to be careful to not leak data though.
Here's an example where it would be ok: let's assume one of your features is date of enlisting which is the date when the property was published for sale. You could, for example, add a feature to the whole dataset called days since enlisting which simply calculates the days between now and when the property was published for sale. However, your median is an example which results in data leakage since it is not "per row" data engineering but "across rows" data engineering applied to train and test data.
That's why the safer approach is to first split the data, remove the target variable from the val/test data and then do feature engineering. Thereby, you avoid any unintended data leakage.

Nicholas James Bailey · Answer

You’re correct: you should avoid feature engineering that brings information about the whole data set, including the testing data, into the training data set.
By involving your test data in the calculation of a median that is then available in your training data, you are leaking information from the testing data set into the training data set.
This article is a really helpful overview of data leakage and how to avoid it.

I do feature engineering on the full dataset, is this wrong?

2 Answers

Add your own answers!

Ask a Question