How to treat data transformation choices as hyperparemeters?

Question

While reading the book hands-on ML by Aurelien Geron, I came across this line-

Treat your data transformation choices as hyperparameters, especially
when you are not sure about them (e.g., if you’re not sure whether to
replace missing values with zeros or with the median value, or to just
drop the rows).

How exactly do I do that? Is there a way to do it via sklearn or do I have to manually keep several datasets (each with a different transformation) and then fit models onto all of them?

shepan6 · Answer

So, the question talks about how to treat transformation choices as hyper parameters.
How I would go about it is the following:
Use one baseline model architecture for the data and then repeat the following:

Instantiate the baseline model (effectively make sure all of the weights are initialised)
Create the transformed dataset
Train the model
Compute generalisation performance measures (AUC, precision, recall, whatever).

Then compare the generalisation performance across all of the data transforms to find the "best" transformation which improves a generalisation metric which is appropriate for your task.

Itamar Mushkin · Answer

What shepan6 is suggesting is basically to manually search for the best "transformation choice hyperparameters" by trying them all and seeing what performs best.
This is a good idea (I upvoted), but if you want to go further, you can use a package like hyperopt and manually define an "objective" function that accepts a parameter that decides on which transformation to use.

How to treat data transformation choices as hyperparemeters?

2 Answers

Add your own answers!

Ask a Question