What happens with activations?

Question

I am playing with convolution network, assembling something between AlexNet and ResNet. Not very deep, about 10 conv. layers including 2 through residual connection, and 3 fully-connected layes at the end. Relu activations, but also tried ELU and sigmoid a bit. Achieved good results with 25 classes (up to 80% top-1 accuracy) and now trying on ImageNet with 1000 classes. And what I get is often stucking near random-quality choice (i.e. accuracy of 0.1% on 1000 classes and corresponding loss) or improving up to like 0.5% accuracy and then dropping back.

This happens with recommended for Adam learning rate of 0.001, a bit less, 2e-4, as well with much smaller l. rates, like 1e-5. What is strange, with small l. rates everything is even more ugly, in contrast to expected slow but reliable learning.

What I have found is that activations starting from middle levels start to decrease a lot. So different channels look like

(base image is ).

If we plot channels’ min-max changes through epochs, we’ll see 
 This is for two different images, with dotted line and on the right axis – standard deviation of activations.
The corresponding train loss improvement and worsening is dimly visible here, blue line:

What is this? Vanishing gradients? Is the network too deep? Shouldn't batch normalization struggle with that? What is the solution, to adjust learning rate continuously?

And everything happens quite fast. I use mini-epochs, 20000 images each, so 1 on my plots is actually 1/60 of normal epoch.

I was able to train 4 times less wide variant of the network on 110 classes, got 42% accuracy, looks reasonable. There I don’t see such big drops. Didn’t notice improvement with SGD, different batch sizes, but noticed that without augmentation learning is faster and less prone to this problem.

Update: I was able to reproduce this on smaller net :) . It learns well with learning rate 5e-4 and batch 64, then 2e-4 and then doesn’t learn much with l.r. 1-e4 and batch like 384 or bigger or quickly degrades with l.r. like 5e-5 or less.
I didn’t mention one of tricks I use: squeeze-and-excitation. So it can apparently lead to massive activations changes. And really there are excitation layers with activations in order of 10-50, even while everything looks fine. 
Now trying to struggle with this without success for some reason. I added

def near1Regularizer(activationVector):                  # <----
    return 0.01 * K.mean((1 - activationVector) ** 2)

…
se_feature = Dense(channel // ratio,
                       activation='relu',
                   name='dense_exc_%d' % (g_excLayerCount * 2 - 1),
                       # kernel_initializer='he_normal',
                   kernel_initializer=My1PlusInitializer(1.0 / 256),
                       use_bias=True,
                       bias_initializer='zeros')(se_feature)
assert se_feature._keras_shape[1:] == (1,1,channel//ratio)
se_feature = Dense(channel,
                   activation='sigmoid',
                   name='dense_exc_%d' % (g_excLayerCount * 2),
                   kernel_initializer='he_normal',
                   use_bias=True,
                   bias_initializer='zeros',
                   activity_regularizer=near1Regularizer)(se_feature)    # <----
se_feature = keras.layers.multiply([input_feature, se_feature])

and getting

The regularization applied near 90 by x. The error also started to increase after a dozen of mini-epochs...

Update 2: reproduced on simpler and smaller net (3 convolution blocks with squeeze-and-excitation and 3 dense blocks) and on small AlexNet (1/16 wide and 110 classes). Very surprizingly, convolution layers activations in order of 100s are very typical, they exist even in pretrained full AlexNet.

What happens with activations?

Add your own answers!

Ask a Question