TransWikia.com

Why is this not ordinary convolution?

Data Science Asked on June 22, 2021

I am currently studying this paper (page 53), in which the suggest convolution to be done in a special manner.

This is the formula:

begin{equation} tag{1}label{1}
q_{j,m} = sigma left(sum_i sum_{n=1}^{F} o_{i,n+m-1} cdot w_{i,j,n} + w_{0,j} right)
end{equation}

Here is their explanation:

As shown in Fig. 4.2, all input feature maps (assume I in total), $O_i (i = 1, · · · , I)$ are mapped into a number of feature maps (assume $J$ in total), $Q_j (j = 1, · · · , J)$ in the convolution layers based on a number of local filters ($I × J$ in total), $w_{ij}$ $(i = 1, · · · , I; j = 1, · · · , J)$. The mapping can be represented as the well-known convolution operation in signal processing.
Assuming input feature maps are all one dimensional, each unit of one feature map in the convolution layer can be computed as equation $eqref{1}$ (equation above).

where $o_{i,m}$ is the $m$-th unit of the $i$-th input feature map $O_i$, $q_{j,m}$ is the $m$-
th unit of the $j$-th feature map $Q_j$ of the convolution layer, $w_{i,j,n}$ is the $n$th element of the weight vector, $w_{i,j}$, connecting the $i$th feature map of the input to the $j$th feature map of the convolution layer, and $F$ is called the filter size which is the number of input bands that each unit of the convolution layer receives.

So far so good:

What i basically understood from this is what I’ve tried to illustrate in this image.

enter image description here

It seem to me what they are doing is actually processing all data points up to F, and across all feature maps. Basically moving in both x-y direction, and compute on point from that.

Isn’t that basically 2d- convolution on a 2d image of size $(I x F)$ with a filter equal to the image size?.
The weight doesn’t seem to differ at all have any importance here..?

One Answer

It's analogous to an extremely specific convolution step on a 2D image. That is, it's analogous to an $N times I$ image with one feature map (e.g. a black & white image, if we're talking about the input), and you choose to use $J$ filters of size $F times I$ which spans the entire width of the image and only strides along the length to create $J$ feature maps of size $(N - F + 1) times 1.$ The restrictions here are that:

-The "image" (or current layer you're working on) necessarily only has one feature map

-The filters span the entire width of the image, so they don't stride along that direction, and the resulting feature maps have a width of $1.$

You could then re-interpret the $(N-F+1) times 1$ resulting feature maps as a single $(N-F+1) times J$ image, which then creates the exact same restrictions for the next convolutional layer (or more restrictions if you're doing pooling).

So yes, the analogy is there, but it's only analogous to a very restricted class of convolution on 2D image arrays, which I don't think is very useful. The kind of application it would be used for is black and white images for which you don't care about translation invariance along one of the dimensions.

The reason CNNs on images are much more flexible than this analogy allows for is because the input objects (and each respective hidden layer) of an image CNN is a 3D array, not a 2D array as it is in this case. The third dimension is the feature maps (the RGB values for the input image, for example). Then you could allow the filter to be of any size in both dimensions, and stride along both directions. I would prefer to keep the text interpretation when it comes to 2D inputs of a CNN.

Answered by Bridgeburners on June 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP