TransWikia.com

How are batch gradients computed on embedding layers?

Data Science Asked by Absox on May 31, 2021

Consider the following model, which is more or less a 12-dimensional vector lookup table with 10 rows, initialized to all zeros.

model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=10, output_dim=12, embeddings_initializer=keras.initializers.zeros))
model.compile(optimizer=keras.optimizers.SGD(),loss=keras.losses.MeanSquaredError())

I simply want it to train to the following data:

x = numpy.append(numpy.zeros(10000), numpy.ones(10000))
y = numpy.append(numpy.random.multivariate_normal(numpy.zeros(12), numpy.diag(numpy.ones(12)), 10000),
                 numpy.random.multivariate_normal(numpy.ones(12)*2, numpy.diag(numpy.ones(12)), 10000), axis=0)
model.fit(x,y,epochs=1,batch_size=1)

When the batch size is 1, the model behaves predictably; using stochastic gradient descent, we train the weights of the embedding layer towards the means of the two conditional distribution – i.e. model(0) tends towards [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], and model(1) tends toward [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2].

However, for batch sizes larger than 1, the weights of the model tend toward the average value of y, without respect for the value of x – i.e. both model(0) and model(1) tend toward [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1].

Why does this happen? If we just want a lookup table, we can alternately implement it using a dense layer with no bias term; this simply requires that we one-hot encode the input.

model = keras.models.Sequential()
model.add(keras.layers.Dense(12, input_shape=(10,), use_bias=False, kernel_initializer=keras.initializers.zeros))
model.compile(optimizer=keras.optimizers.SGD(),loss=keras.losses.MeanSquaredError())

x = tensorflow.one_hot(numpy.append(numpy.zeros(10000), numpy.ones(10000)), 10)
y = numpy.append(numpy.random.multivariate_normal(numpy.zeros(12), numpy.diag(numpy.ones(12)), 10000),
                 numpy.random.multivariate_normal(numpy.ones(12)*2, numpy.diag(numpy.ones(12)), 10000), axis=0)
model.fit(x,y,epochs=16,batch_size=128)

This behaves as I would expect, even when using mini-batch for training. So why doesn’t it work correctly using the embedding layer?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP