Data Science Asked on September 19, 2020
So, I have a dataset that is too big to load into memory all at once. Therefore I want to use a generator to load batches of data to train on.
In this scenario, how do I go about performing scaling of the features using LabelEncoder + StandardScaler from scikitlearn?
Some more context:
I have 10million+ samples of data with 23 features and 1 label column in a database.
My setup used to be (when it was ~3 million samples) to load in pandas with sql, perform some more feature extractions, use LabelEncoder on some features, do train/test split and then use StandardScaler on the training features. And then fit my keras model.
However this workflow is no longer possible on my machines because of the amount of data. (MemoryErrors.)
I’m looking into using keras.utils.Sequence to load batches of data instead of everything in memory at once, this way i would only need to have the complete list of indexes, and one full batch in memory at a time.
However how would I go about LabelEncoding and more importantly: How would I go about feature scaling in this scenario? And given the context, is this a correct approach?
It is a correct approach to standardize on your training features. In that way, you ensure not to give any information from the testing set to the training set.
About features scaling, if you have too many samples to fit your scaler at once, you could use the partial_fit
(see here) method of StandardScaler
in sklearn. Load sequentially your training features and do the partial_fit. Once finished, your scaler is ready to be used on your training/testing batches.
About label encoding, either you already have an array containing all the labels, so you can fit your LabelEncoder
, or you would have to load sequentially all your data to get all the different labels before fitting (LabelEncoder
does not have partial_fit
method).
Answered by etiennedm on September 19, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP