Data Science Asked by Tuukka Nieminen on October 10, 2020
I am using Keras with Tensorflow backend to train a simple 1D CNN to detect specific events from sensor data. While the data with tens of millions samples easily fits to the ram in the form of an 1D float array, it obviously takes a huge amount of memory to store the data as a N x inputDim array that can be passed to model.fit for training. While I can use model.fit_generator or model.train_on_batch to generate the required mini batches on the fly, for some reason I am observing a huge performance gap between model.fit and model.fit_generator & model.train_on_batch even though everything is stored in memory and mini batch generation is fast as it basically only consists of reshaping the data. Therefore, I’m wondering whether I am doing something terribly wrong or if this kind of performance gap is to be expected. I am using the cpu version of Tensorflow 2.0 with 3.2 GHz Intel Core i7 processor (4 cores with multithreading support) and Python 3.6.3. on Mac Os X Mojave.
In short, I created a dummy python script to recreate the issue, and it reveals that with batch size of 64, it takes 407 seconds to run 10 epochs with model.fit, 1852 seconds with model.fit_generator, and 1985 seconds with model.train_on_batch. CPU loads are ~220%, ~130%, and ~120% respectively, and it seems especially odd that model.fit_generator & model.train_on_batch are practically on par, while model.fit_generator should be able to parallelise mini batch creation and model.train_on_batch definitely does not. That is, model.fit (with huge memory requirements) beats the other solution candidates with easily manageable memory requirements by a factor of four. Obviously, CPU loads increase and total training times decrease by increasing batch size, but model.fit is always fastest with a a margin of at least two up to batch size of 8096. In that case, model.fit takes 99 seconds to run 10 epochs with cpu load of ~860% (or pretty much everything I have got), model.fit_generator takes 179 seconds with cpu load of ~700%, and model.train_on_batch takes 198 seconds with CPU load of ~680%.
Is this kind of behaviour normal (when there is no GPU involved) or what could/should be done in order to increase the computational performance of the less memory intensive options with sensible batch sizes? Specifically model.fit_generator fails to provide decent performance. It seems that no such option is available to divide all data into manageable pieces, and then run model.fit in iterative manner with constantly changing training data.
Please do note that the provided dummy script is just what the name suggests, and the amount of data has been trimmed so that it makes all three options feasible. The used model, however, is similar to what I am actually using (to provide a realistic situation).
from tqdm import tqdm
import numpy as np
import tensorflow as tf
import time
import sys
import argparse
inputData = None
outputData = None
batchIndices = None
opts = None
class DataGenerator(tf.keras.utils.Sequence):
global inputData
global outputData
global batchIndices
'Generates data for Keras'
def __init__(self, batchSize, shuffle):
'Initialization'
self.batchIndices = batchIndices
self.batchSize = batchSize
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int( np.floor( inputData.size / self.batchSize ) )
def __getitem__(self, index):
'Generate one batch of data'
# Generate data
X, y = self.__data_generation(self.indexes[index*self.batchSize:(index+1)*self.batchSize])
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(inputData.size)
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, INDX):
'Generates data containing batch_size samples'
# Generate data
X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(INDX.size,1)) , inputData.size ) ], axis=2)
y = outputData[INDX,:]
return X, y
def main( ):
global inputData
global outputData
global batchIndices
global opts
# Data generation
print(' ')
print('Generating data...')
np.random.seed(0) # For reproducible results
inputDim = int(104) # Input dimension
outputDim = int( 2) # Output dimension
N = int(1049344) # Total number of samples
M = int(5e4) # Number of anomalies
trainINDX = np.arange(N, dtype=np.uint32)
inputData = np.sin(trainINDX) + np.random.normal(loc=0.0, scale=0.20, size=N) # Source data stored in a single array
anomalyLocations = np.random.choice(N, M, replace=False)
inputData[anomalyLocations] += 0.5
outputData = np.zeros((N,outputDim)) # One-hot encoded target array without ones
for i in range(N):
if( np.any( np.logical_and( anomalyLocations >= i, anomalyLocations < np.mod(i+inputDim,N) ) ) ):
outputData[i,1] = 1 # set class #2 to one if there is at least a single anomaly within range [i,i+inputDim)
else:
outputData[i,0] = 1 # set class #1 to one if there are no anomalies within range [i,i+inputDim)
print('...completed')
print(' ')
# Create a model for anomaly detection
model = tf.keras.Sequential([
tf.keras.layers.Conv1D(filters=24, kernel_size=9, strides=1, padding='valid', dilation_rate=1, activation='relu', use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', input_shape=(inputDim,1)),
tf.keras.layers.MaxPooling1D(pool_size=4, strides=None, padding='valid'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(20, activation='relu', use_bias=True),
tf.keras.layers.Dense(outputDim, activation='softmax')
])
model.compile( tf.keras.optimizers.Adam(),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=[tf.keras.metrics.CategoricalAccuracy()])
print(' ')
relativeIndices = np.arange(inputDim) # Indices belonging to a single sample relative to current position
batchIndices = np.tile( relativeIndices, (opts.batchSize,1) ) # Relative indices tiled into an array of size ( batchSize , inputDim )
stepsPerEpoch = int( np.floor( N / opts.batchSize ) ) # Steps per epoch
# Create an intance of dataGenerator class
generator = DataGenerator(batchSize=opts.batchSize, shuffle=True)
# Solve by gathering data into a large float32 array of size ( N , inputDim ) and feeding it to model.fit
startTime = time.time()
X = np.expand_dims( inputData[ np.mod( np.tile(relativeIndices,(N,1)) + np.reshape(trainINDX,(N,1)) , N ) ], axis=2)
y = outputData[trainINDX, :]
history = model.fit(x=X, y=y, sample_weight=None, batch_size=opts.batchSize, verbose=1, callbacks=None, validation_split=None, shuffle=True, epochs=opts.epochCount)
referenceTime = time.time() - startTime
print(' ')
print('Total solution time with model.fit: %6.3f seconds' % referenceTime)
print(' ')
# Solve with model.fit_generator
startTime = time.time()
history = model.fit(x=generator, steps_per_epoch=stepsPerEpoch, verbose=1, callbacks=None, epochs=opts.epochCount, max_queue_size=1024, use_multiprocessing=False)
generatorTime = time.time() - startTime
print(' ')
print('Total solution time with model.fit_generator: %6.3f seconds (%6.2f %% more)' % (generatorTime, 100.0 * generatorTime/referenceTime))
print(' ')
# Solve by gathering data into batches of size ( batchSize , inputDim ) and feeding it to model.train_on_batch
startTime = time.time()
for epoch in range(opts.epochCount):
print(' ')
print('Training epoch # %2d ...' % (epoch+1))
print(' ')
np.random.shuffle(trainINDX)
epochStartTime = time.time()
for step in tqdm( range( stepsPerEpoch ) ):
INDX = trainINDX[ step*opts.batchSize : (step+1)*opts.batchSize ]
X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(opts.batchSize,1)) , N ) ], axis=2)
y = outputData[INDX,:]
history = model.train_on_batch(x=X, y=y, sample_weight=None, class_weight=None, reset_metrics=False)
print(' ')
print('...completed with loss = %9.6e, accuracy = %6.2f %%, %6.2f ms/step' % (history[0], 100.0*history[1], (1000*(time.time() - epochStartTime)/np.floor(trainINDX.size / opts.batchSize))))
print(' ')
batchTime = time.time() - startTime
print(' ')
print('Total solution time with model.train_on_batch: %6.3f seconds (%6.2f %% more)' % (batchTime, 100.0 * batchTime/referenceTime))
print(' ')
parser = argparse.ArgumentParser()
parser.add_argument('--batchSize', type=int,
default=128,
help='Batch size')
parser.add_argument('--epochCount', type=int,
default=5,
help='Epoch count')
opts, unparsed = parser.parse_known_args()
if __name__== "__main__":
main( )
```
To answer the question myself, I recently updated to Python 3.7.7 and TensorFlow 2.2.0 rc2 and suddenly all my issues vanished. Now, running for 5 epochs with the default batch size of 128, model.fit with explicitly formed numpy arrays takes 126.162 seconds, model.fit with the provided generator takes 149.053 seconds, and model.train_on_batch takes 240.698 seconds. This with the default version of TensorFlow w/o support for AVX2 & FMA instructions supported by my CPU.
Answered by Tuukka Nieminen on October 10, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP