Do Convolution Layers in a CNN Treat the Previous Layer Outputs as Channels?

Question

Lets say you have a max pooling layer that gives 10 downsampled feature maps. Do you stack those feature maps, treat them as channels and convolve that 'single image' of depth 10 with a 3d kernel of depth 10? That is how I have generally thought about it. Is that correct?

This visualization confused me:
http://scs.ryerson.ca/~aharley/vis/conv/flat.html

On the second convolution layer in the above visualization most of the feature maps only connect to 3 or 4 of the previous layers maps. Can anyone help me understand this better?

Related side question: If our input is a color image our first convolution kernel will be 3D. This means we learn different weights for each color channel (I assume we aren't learning a single 2D kernel that is duplicated on each channel, correct)?

Neil Slater · Accepted Answer

Lets say you have a max pooling layer that gives 10 downsampled feature maps. Do you stack those feature maps, treat them as channels and convolve that 'single image' of depth 10 with a 3d kernel of depth 10? That is how I have generally thought about it. Is that correct?

Yes. The usual convention in a CNN is that each kernel is always the same depth as the input, so you can also think of this as a "stack" of 2D kernels that are associated with the input channels and summed to make one output channel - because under the convention that $N_{in_channels} = N_{kernel_depth}$ this is mathematically the same. Expressing as a 3D convolution allows for simpler notation and code.

On the second convolution layer in the above visualization most of the feature maps only connect to 3 or 4 of the previous layers maps. Can anyone help me understand this better?

The diagram is non-standard in that respect, although it seems to show pooling and fully-connected layers as normal. It might be a mistake in the diagram, or something unconventional about that specific CNN.

If our input is a color image our first convolution kernel will be 3D. This means we learn different weights for each color channel (I assume we aren't learning a single 2D kernel that is duplicated on each channel, correct)?

Correct. You can see this in the visualised filters for AlexNet (do note that for computational reasons, AlexNet specialised one half of its filters to work in greyscale, and had other clever optimisations that we don't use nowadays because available GPU power is high enough to not need them). Most implementations will also treat a greyscale image as a 1-channel 3D shape for consistency.

Do Convolution Layers in a CNN Treat the Previous Layer Outputs as Channels?

One Answer

Add your own answers!

Ask a Question