How to handle preprocessing (StandardScaler, LabelEncoder) when using data generator to train?

Question

So, I have a dataset that is too big to load into memory all at once. Therefore I want to use a generator to load batches of data to train on.

In this scenario, how do I go about performing scaling of the features using LabelEncoder + StandardScaler from scikitlearn?

Some more context:

I have 10million+ samples of data with 23 features and 1 label column in a database.

My setup used to be (when it was ~3 million samples) to load in pandas with sql, perform some more feature extractions, use LabelEncoder on some features, do train/test split and then use StandardScaler on the training features. And then fit my keras model.

However this workflow is no longer possible on my machines because of the amount of data. (MemoryErrors.)

I'm looking into using keras.utils.Sequence to load batches of data instead of everything in memory at once, this way i would only need to have the complete list of indexes, and one full batch in memory at a time.

However how would I go about LabelEncoding and more importantly: How would I go about feature scaling in this scenario? And given the context, is this a correct approach?

etiennedm · Answer

It is a correct approach to standardize on your training features. In that way, you ensure not to give any information from the testing set to the training set.
About features scaling, if you have too many samples to fit your scaler at once, you could use the partial_fit (see here) method of StandardScaler in sklearn. Load sequentially your training features and do the partial_fit. Once finished, your scaler is ready to be used on your training/testing batches.
About label encoding, either you already have an array containing all the labels, so you can fit your LabelEncoder, or you would have to load sequentially all your data to get all the different labels before fitting (LabelEncoder does not have partial_fit method).

How to handle preprocessing (StandardScaler, LabelEncoder) when using data generator to train?

One Answer

Add your own answers!

Ask a Question