Data Science Asked on June 14, 2021
This question boils down to “how do convolution layers exactly work.
Suppose I have an $n times m$ greyscale image. So the image has one channel.
In the first layer, I apply a $3times 3$ convolution with $k_1$ filters and padding. Then I have another convolution layer with $5 times 5$ convolutions and $k_2$ filters. How many feature maps do I have?
The first layer gets executed. After that, I have $k_1$ feature maps (one for each filter). Each of those has the size $n times m$. Every single pixel was created by taking $3 cdot 3 = 9$ pixels from the padded input image.
Then the second layer gets applied. Every single filter gets applied separately to each of the feature maps. This results in $k_2$ feature maps for every of the $k_1$ feature maps. So there are $k_1 times k_2$ feature maps after the second layer. Every single pixel of each of the new feature maps got created by taking $5 cdot 5 = 25$ “pixels” of the padded feature map from before.
The system has to learn $k_1 cdot 3 cdot 3 + k_2 cdot 5 cdot 5$ parameters.
Like before: The first layer gets executed. After that, I have $k_1$ feature maps (one for each filter). Each of those has the size $n times m$. Every single pixel was created by taking $3 cdot 3 = 9$ pixels from the padded input image.
Unlike before: Then the second layer gets applied. Every single filter gets applied to the same region, but all feature maps from before. This results in $k_2$ feature maps in total after the second layer got executed. Every single pixel of each of the new feature maps got created by taking $k_2 cdot 5 cdot 5 = 25 cdot k_2$ “pixels” of the padded feature maps from before.
The system has to learn $k_1 cdot 3 cdot 3 + k_2 cdot 5 cdot 5$ parameters.
Like above, but instead of having $5 cdot 5 = 25$ parameters per filter which have to be learned and get simply copied for the other input feature maps, you have $k_1 cdot 3 cdot 3 + k_2 cdot k_1 cdot 5 cdot 5$ paramters which have to be learned.
For all answers, please give some evidence (papers, textbooks, documentation of frameworks) that your answer is correct.
Is the pooling applied always only per feature map or is it also done over multiple feature maps?
I’m relatively sure that type 1 is correct and I got something wrong with the GoogLe paper. But there a 3D convolutions, too. Lets say you have 1337 feature maps of size $42 times 314$ and you apply a $3 times 4 times 5$ filter. How do you slide the filter over the feature maps? (Left to right, top to bottom, first feature map to last feature map?) Does it matter as long as you do it consistantly?
I am not sure about the alternatives described above, but the commonly used methodology is:
Before the application of the non-linearity, each filter output depends linearly on all of the feature maps before within the patch, so you end up with $k_2$ filters after the second layers. The overall number of parameters is $3 dot{} 3dot{}k_1 + k_1dot{} 5 dot{} 5 dot{} k_2$.
Bonus 1: Pooling is done per feature map, separately.
Bonus 2: The order of "sliding" does not matter. In fact, each output is computed based on the previous layer, so the output filter responses do not depend on each other. They can be computed in parallel.
Correct answer by ChristianSzegedy on June 14, 2021
Check this lecture and this visualization
Usually it is used type 2.1 convolution. In the input you have $NxMx1$ image, then after first convolution you will obtain $N_1xM_1xk_1$, so your image after first convolution will have $k_1$ channels. The new dimension $N_1$ and $M_1$ will depend on your stride $S$ and padding $P: N_1 = (N - 3 + 2P)/S + 1$, you compute $M_1$ in analogy. For the first conv layer you will have $3x3xk_1 + k_1$ weights. There is added $k_1$ for biases in nonlinear function.
In the second layer you have as an input image with size $N_1xM_1xk_1$, where $k_1$ is new number of channels. And after second convolution you obtain $N_2xM_2xk_2$ image (array). You have $5x5xk_2xk_1+k_2$ parameters in the second layer.
For $1x1$ convolution with $k_3$ filters and input $NxMxC$ ($C$ is number of input channels) you will obtain new image (array) $NxMxk_3$, so $1x1$ make sense. They were introduced in this paper
Bonus 1: pooling is applied per feature map.
For details please see slides for CNN course on Stanford - you have there nice visualisation how convolution is summed from several input channels.
Answered by pplonski on June 14, 2021
I have just struggled with this same question for a few hours. Thought I'd share the insite that helped me understand it.
The answer is that the filters for the second convolutional layer do not have the same dimensionality as the filters for the first layer. In general, the filter has to have the same number of dimensions as its inputs. So in the first conv layer, the input has 2 dimensions (because it is an image). Thus the filters also have two dimensions. If there are 20 filters in the first conv layer, then the output of the first conv layer is a stack of 20 2D feature maps. So the output of the first conv layer is 3 dimensional, where the size of the third dimension is equal to the number of filters in the first layer.
Now this 3D stack forms the input to the second conv layer. Since the input to the 2nd layer is 3D, the filters also have to be 3D. Make the size of the second layer's filters in the third dimension equal to the number of feature maps that were the outputs of the first layer.
Now you just convolve over the first 2 dimensions; rows and columns. Thus the convolution of each 2nd layer filter with the stack of feature maps (output of the first layer) yields a single feature map.
The size of the third dimension of the output of the second layer is therefore equal to the number of filters in the second layer.
Answered by Alex Blenkinsop on June 14, 2021
The first layer consists of $k_1$ kernels with size $3 cdot 3 cdot 1$ to give $k_1$ feature maps which are stacked depth-wise.
The second layer consists of $k_2$ kernels with size $5 cdot 5 cdot k_1$ to give $k_2$ feature maps which are stacked depth-wise.
That is, the kernels in a convolutional layer span the depth of the output of the previous layer.
A layer with $1 times 1$ convolutional layer actually has $k_n$ kernels of size $1 cdot 1 cdot k_{n-1}$.
Bonus question 2 is not something I'm familiar with, but I will guess the depth parameter in the convolution becomes an extra dimension.
e.g. If the output of a layer is size $m cdot n cdot k_{n}$, a 3D convolution with padding would result in an output of size $m cdot n cdot k_{n+1} cdot k_{n}$
Answered by geometrikal on June 14, 2021
Suppose the input layers has the $k_{input}$ channels, than the number of parameters to be learned by neural network is: $$ 3 cdot 3 cdot k_{input} cdot k_1 + 5 cdot 5 cdot k_1 cdot k_2 $$ Because each of the input channels of the image is mapped to one of the output channels. There is a separate filter to each pair $(c_{input}, c_{output})$, which are indexing the channels of the image. One can think of the convolution filter as a tensor of shape $(c_{input}, c_{output}, s_1 ldots s_d)$, where $d$ is the spatial dimensionality of the data.
On the one hand, one can put nonlinearity after, such that the filter + activation performs a nonlinear operation, changing the output in some complicated way. Another point, which makes them useful and is the cornerstone in the MobileNet https://arxiv.org/abs/1704.04861, that number of operations in ordinary convolution scales multiplicatively with the increase of the filters size and number of channels:
$$
c_{input} cdot c_{output} cdot n_1 cdot n_2
$$
For 2D convolution. Setting the $n_1 = n_2$ one works with not so much parameters , combining them with depthwise
convolutions. Via $1 times 1$ convolutions one can reduce the number of feature from $c_1$ to $c_2 < c_1$ in some educated way, where the network itself learns, hopefully, the optimal way to perform the dimensionality reduction.
Pooling is applied feature wise, for each channel one obtains a downsampled image contructed via some aggregation function (max
, average
) of the multiple pixels, belonging to the same channel - no interaction between different channels (R, G, B), for instantce.
You do not slide over feature map, convolution kernel comprises all feature maps from the $c_{in}$ to $c_{out}$.
Answered by spiridon_the_sun_rotator on June 14, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP