Adding layer to a trained CNN to process higher resolution images. Tried 2 schemes, 1 works fine, 1 fails completely

Question

I'm working with images coming from a sensor, for which 1 pixel corresponds to 2 mm in the real world. I've built and trained a CNN that does semantic segmentation of the image (128x128 pixels) and it works quite well. The objects that need to be recognized have a specific dimension in mm with little variability. So, if the pixel size is changed, the number of pixels taken on average by an object changes. It is not optimal to use a network trained with 2 mm pixels and feed it a patch of an image with 1 mm pixels.
Now we have a new instrument with higher resolution, 1 mm pixel size (256x256 image size). The higher resolution images are less, the training would be more complicated with many more free parameters. Also I'd like anyway to keep using my CNN that works well. So I've looked into transfer learning. However
Scheme 1 (fails) . Add a couple of convolutional layers at 1 mm, a downsample (average pooling) layer to 2 mm that "injects" data directly in the pretrained network (right after what would be the input convolution/activation). Then I freeze all pre-trained layers and let the networks train. It's so trivial it should work. Instead it doesn't. I've even tried changing the loss and making it so that the output of the first 2 mm convolution should be identical (mean absolute error between the output of the network with the 1 mm input and the network with 2 mm input given the same input image downsampled). By definition it should reach 0 loss in a couple of epochs but it doesn't! The output gets very similar... but small differences then explode in later layers and the output is nowhere similar to what it should be. Like dice 0 instead of 0.9, and visibly is totally wrong.
Scheme 2 (works well) . My 1 mm input gets directly fed in an average pooling layer and then fed in the 2 mm network. In parallel it goes into a couple of convolutional layers. Then the last layer of the 2mm network (before the 1x1 final convolution) gets upsampled to 1 mm and concatenated to the 1 mm layer. This works perfectly without resorting to any special trick for training.
So... Why does my scheme 1 fail?? Any theoretical issue?

Adding layer to a trained CNN to process higher resolution images. Tried 2 schemes, 1 works fine, 1 fails completely

Add your own answers!

Ask a Question