Understanding how convolutional layers work

Question

After working with a CNN using Keras and the Mnist dataset for the well-know hand written digit recognition problem, I came up with some questions about how the convolutional layer work. I can understand what the convolution process is.

My first question is: What are the filters? I can understand their purpose. They are used to map edges, shapes, etc. on an image. But how are they getting initialized? Do they have random initial value or there are standard image filters that are getting used? If they are getting initialized with random value then the values should get changed on the training process of the network. If that's the case then a new question is created, how does someone backpropagate the filter of the convolutional layer? What is the algorithm behind this process?
Secondly, I have noticed that I can add an activation function to the convolutional layer in Keras. Is the entire matrix of the output getting passed through the activation function? How does the usage of an activation function changes the learning process of the convolutional layer?
Last but not least, does a convolutional layer have weight and biases like a dense layer? Do we multiply the output matrix after the convolution process with a weight matrix and add some biases before passing it through the activation function? If that's true, then do we follow the same process as we do with the dense layers to train these weights and biases?

Akshay Sehgal · Accepted Answer

What are the filters?

A filter/kernel is a set of learnable weights which are learned using the backpropagation algorithm. You can think of each filter as storing a single template/pattern. When you convolve this filter across the corresponding input, you are basically trying to find out the similarity between the stored template and different locations in the input.

But how are they getting initialized? Do they have random initial value or there are standard image filters that are getting used?

Filters are usually initialized at a seemingly arbitrary value and then you would use a gradient descent optimizer to optimize the values so that the filters solve your problem.
There are many different initialization strategies.

Sample from a distribution, such as a normal or uniform distribution
Set all values to 1 or 0 or another constant
There are also some heuristic methods that seem to work very well in practice, a popular one is the so-called glorot initializer named after Xavier Glorot who introduced them here. Glorot initializers also sample from distribution but truncate the values based on the kernel complexity.
For specific types of kernels, there are other defaults that seem to perform well. See for example this article.

If they are getting initialized with random value then the values should get changed on the training process of the network. If that's the case then a new question is created, how does someone backpropagate the filter of the convolutional layer? What is the algorithm behind this process?

Consider the convolution operation just as a function between the input image and a matrix of random weights. As you optimize the loss function of your model, the weights (and biases) are updated such that they start forming extremely good discriminative spacial features. That is the purpose of backpropogation, which is performed with the optimizer that you defined in your model architecture. Mathematically there are a few more concepts that go into how the backprop happens on a convolution operation (full conv with 180 rotations). If you are interested then check this link.

Is the entire matrix of the output getting passed through the activation function? How does the usage of an activation function change the learning process of the convolutional layer?

Let's think of activation functions as just non-linear "scaling" functions. Given an input, the job of an activation function is to "squish" the data into a given range (example -> Relu 'squishes' the input into a range(0,inf) by simply setting every negative value to zero, and returning every positive value as is)
Now, in neural networks, activations are applied at the nodes which apply a linear function over the input feature, weight matrix, and bias (mx+c). Therefore, in the case of CNN, it's the same. Once your forward-pass takes the input image, does a convolution function over it by applying a filter (weight matrix), adds a bias, the output is then sent to an activation function to 'squish' it non-linearly before taking it to the next layer.
It's quite simple to understand why activations help. If I have a node that spits out x1 = m0*x0+b0 and that is then sent to another node which spits out x2 = m1*x1+b1, the overall forward pass is just x2 = m1*(m0*x0+b0)+b1 which is the same as x2 = (m1*m0*x0) + (m1*b0+b1) or x2 = M*x0 + B. This shows that just stacking 2 linear equations gives another linear equation and therefore in reality there was no need for 2 nodes, instead I could have just used 1 node and used the new M and B values to get the same result x2 from x0.
This is where adding an activation function helps. Adding an activation function allows you to stack neural network layers such that you can explore the non-linear model space properly, else you would only be stuck with the y=mx+c model space to explore because all linear combinations of linear functions is a linear model itself.

Does a convolutional layer have weight and biases like a dense layer?

Yes, it does. Its added after the weight matrix (filter) is applied to the input image using a convolution operation conv(inp, filter)

Do we multiply the output matrix after the convolution process with a weight matrix and add some biases before passing it through the activation function?

A dot product operation is done between a section of the input image and the filter while convolving over the larger input image. The output matrix, is then added with bias (broadcasting) and passed through an activation function to 'squish'.

If that's true, then do we follow the same process as we do with the dense layers to train these weights and biases?

Yes, we follow the exact same process in forward pass except that there is a new operation added to the whole mix, which is convolution. It changes the dynamics especially for the backward pass but in essence, the overall intuition remains the same.

The crux for intuition is -

Do not confuse a feature and a filter. A filter is what helps you to extract features (basic patterns) from the input image using operations such as dot, conv, bias and activations
Each filter allows you to extract a 2D map of some simple pattern that exists over the image (such as an edge). If you have 20 filters, then you will get 20 feature maps for a 3 channel image, that are stacked as channels in the output.
Many such features, which capture different simple patterns, are learnt as part of the training process and become the base features for the next layer (which could be another CNN or a dense)
Combinations of these features allow you to perform your modeling task.
The filters are trained by optimizing towards minimizing a loss function using backprop. It follows the backward reasoning:

- How can I minimize my loss?
      - How can I find the best features that minimize the loss?
      - How can I find the best filters that generate the best features? 
      - What are the best weights and biases which give me the best filters?

Here's a good reference image to keep in mind whenever working with CNNs (just to reinforce the intuition)

Hope that answers your questions.

Josh Purtell · Answer

In some ways, convolutions do not introduce a radical departure from the standard architecture. Because the operations which are applied to the filtered input (max,min,mean,etc) are continuous, these filters amount to a lossy "layer" of the network. You are right to intuit that the filter parameters can be trained — so a filter which transforms a 2x2 square according to [0,1,2,0] and then yields the max in one training period may transform according to [.1,.9,1.7,0] and yield the max in the next. This training can be done using SGD. You can think of the transition matrix as being equivalently expressed as a set of weights and biases, along with a function — although it may not be instantiated as such in every package (I haven't used keras).
As far as I am aware, however, the filter function is not subject to change during training — a "max" layer will not change into a "mean" layer. Consider why this is.
On the matter of activation functions — they just introduce more non-linearity to the result. Additionally, if they are monotone (this is often the case), then they should be able to commute with many of the common filter operations — like max, min, mean, etc. So the layer output could look like Max(Relu(Filter(Input))) or Relu(Max(Filter(Input))), with the former case probably being a bit more idiosyncratic.

10xAI · Answer

CNN learns the same way a Dense Neural network learns i.e. Forwardpass and Backpropagation.
 What we learn here are the weights of the filters.
So, answers to your individual questions -

But how are they getting initialized? - Standard init. e.g. glorot_uniform
then the values should get changed on the training process of the network. Yes
How does someone backpropagate the filter of the convolutional layer? What is the algorithm behind this process? - Just like ANN Backpropagation with GradientDescent
I can add an activation function to the convolutional layer in Keras. Is the entire matrix of the output getting passed through the activation function? - Yes, we keep ReLU most of the time
How does the usage of an activation function change the learning process of the convolutional layer? - It's for the same reason we use it in ANN i.e. Non-linearity
Does a convolutional layer has weight and biases like a dense layer? - Yes
If that's true, then do we follow the same process as we do with the dense layers to train these weights and biases? Yes, just adding the concept of shared weight/filters/convolution and pooling

I will try to explain some key points of CNN to clarify above answers -

Each filter does a convolution across the n-D volume e.g. 3-D for RGB
So, it does an element-wise multiplication with the pixels, the output is summed and passes through an Activation function
This becomes the single element of one Feature map
Each filter creates one feature map.
Filter depth will be equal to the number of feature maps e.g. if you used 20 filters for the first RGB image. It will create 20 feature maps and if you use 5x5 filters on this layer, then filter size = 5x5x20.
Each filter will add parameters = its size e.g. 25 for the last example
If you want to visualize like a simple NN. See below image. All the theta are multiplied, summed, and pass through an activation function. Backpropagation happens in the same manner as in a Dense neural network

$hspace{6cm}$Image Credit - Jeremy Jordan 
You can read these references to develop a black and white intuition.
CS231n: Winter 2016
Jeremy Jordan blog
Numpy Implementation

leftaroundabout · Answer

My first question is: What are the filters?

Convention in Deep Learning is to call the weights used in the convolution either filters or synonymously kernels. Other fields make a distinction between these two terms – in my book, the array of weights is the kernel, and the operation that performs a convolution with those weights is the filter. I.e., a filter is a function that takes some input e.g. image and gives you a, well, filtered image. With the understanding that we're talking convolutional filters, training the kernel is equivalent to training the filter, because the filter is completely defined by the weights in the kernel.
(Mathematically speaking, convolutional filters are the class of linear time-invariant filters with compact support.)

But how are they getting initialized?

There's a myriad of ways, see other answers.

how does someone backpropagate the filter of the convolutional layer

That's where it does pay off for understanding to make a distinction between filters and kernels. What you're actually doing is passing two arguments to the convolution operation: the kernel and the input.
$$
  f(k,x) = kstar x
$$
The filter is $f$ partially applied to the kernel:
$$
  f_k = backslash x mapsto f(k,x)
$$
That is what you're eventually interested in; after training the kernel will be fixed so the filter is only a function of $x$. But you can't really backpropagate the filter (at least in the common frameworks) because it's a function. What you backpropagate is the kernel, and that works the same way as you'd backpropagate any other parameters: you evaluate $f$ together with its derivatives (of both arguments!) for one particular $x$ in the forward pass, and then send through a dual vector in the backwards pass that tells you the gradient contributions in both $x$ and $k$. The one in $x$ you back-pass further to the preceding layer, the one in $k$ you use for the learning update.

Secondly, I have noticed that I can add an activation function to the convolutional layer in Keras

Yes, but the activation isn't really part of the convolution operation. It's best understood as a separate layer, but because it doesn't have any parameters and because CNNs typically contain a Relu after each and every convolution, Keras has a shortcut for this.
$$
  g(k,x) = operatorname{Relu}(f_k(x))
$$
$$
  g_k = bigl(backslash xmapsto operatorname{Relu}(f_k(x))bigr)
      = operatorname{Relu} circ f_k
$$
To backpropagate this, you first pull the backwards pass through the activation's derivative before getting to the convolution.

Last but not least, does a convolutional layer have weight and biases like a dense layer?

Yes, the weights are in the kernel and typically you'll add biases too, which works in exactly the same way as it would for a fully-connected architecture.
One thing that is important in convolutional architectures, and often not really explained very well, is that one such layer isn't actually just a single convolutional filter but a whole “bank” of such filters, each with its own kernel. Thus for every input you get a whole family of outputs, which are called channels:
$$
  f_i(mathbf{k},x) = f(k_i,x)
$$
Then, you pass all of those to the next layer, i.e. the layers also need to accept multiple channels – which is again handled by having multiple kernels, i.e. you have a whole matrix of kernels. A mathematical way of looking at this is that the signal flow contains not vectors in the space $I$ of images, but in a tensor-product space $mathbb{R}^motimes I$, where $m$ is the number of channels. And on the “channel space”, you're actually performing fully-connected. (When people talk about $1times1$ convolutional layers, it means they're not actually performing a meanigful convolution at all, but just a fully-connected matrix between the channels.)

Understanding how convolutional layers work

4 Answers

Add your own answers!

Ask a Question