Convolution is a technique very widely used in image processing and signal processing. There are essentially two operations that can be performed, convolution and cross-correlation. In signal/image processing, convolution is defined as:
It is defined as the integral of the product of the two functions after one is reversed and shifted.
The function g acts as a filter. It slides over the function f along the horizontal axis after being reversed. The area of intersection is calculated between f and reversed g. The area of the intersection at a specific point is called convolution.
While performing cross-correlation, the filter function g is not reversed and it is slid over the function and the intersection area is found out. The intersection area at a point is known as cross-correlation. It is also known as sliding dot product or sliding inner-product of two functions.
In deep learning, the filters in the convolution are not reversed. It is technically cross-correlation. We call it convolution in deep learning just out of convention.
The weights are learned during the training for the feature detectors. If the reversal is the correct operation then after learning the weights, the function will get reversed on its own. Hence, there is no need to reverse the filter in the first place.
CONVOLUTION AND FEATURE DETECTORS AS IN DEEP LEARNING
This is the first layer of CNN that processes the input. The main component of this layer is known as a filter or kernel a.k.a. Feature Detector. A very simple scalar product takes place in this layer.
Convolution is an important process and the purpose is to extract useful features from the input. There is a wide variety of feature detectors that are used for convolution. These feature detectors highlight and extract different features. Various features are extracted through convolution using filters whose weights are automatically learned during training. All these extracted features are then ‘combined’ to make decisions.
Feature Detectors are the initial and very important step of convolution in a CNN. Feature detectors are matrices containing specific values in a specific order. These feature detectors are also referred to as kernels or filters.
There is a little difference between kernels and filters. Sometimes, they are used interchangeably, which could create confusion. Essentially, these two terms have a subtle difference. A “Kernel” refers to a 2D array of weights. The term “filter” is for 3D structures of multiple kernels stacked together. For a 2D filter, the filter is the same as the kernel. But for a 3D filter and most convolutions in deep learning, a filter is a collection of kernels. Each kernel is unique, emphasizing different aspects of the input channel.
When we talk about convolution in 2D, there is only one channel present in the input data and output feature. Feature detectors work in a very simple way to highlight all the important features of an input image. The feature detectors translate over the image and the value in each element of the feature detector is multiplied by every corresponding element of the area of the input image covered by the feature detector and then all the values are added up (say if it’s a 3x3 matrix over an 8x8 input image then a certain 3x3 shall be occupied at each step)and the sum of values is written in the resulting matrix. In simple terms, the feature detectors are made to scan the input image by overlapping the image section and comparing the similar overlapping values and adding the ones that overlap.
The blue grid is the feature detector and the red grid is the input image. The blue grid first operates with the 3x3 grid under its shadow and performs element-wise multiplication and then the addition of all the multiplied elements is performed. This summation gives one single number. This is the purple output represented in the figure. Another example is depicted below in the live working of a feature detector on an input image to generate a feature map.
The resulting matrix is called the feature map and is a matrix of reduced size. The feature detectors go over the image in a certain fashion. They can have strides of any value. Strides is the number of columns and rows a feature detector shifts after analysis of one patch of the input image.
TYPES OF FEATURE DETECTORS
Feature detectors are of multiple types and variations. The difference comes in the values of the elements of the matrices and they perform different functions on the input image. Some of the different types of feature detectors are given below.
1. To blur an image
2. To sharpen an image
3. Edge detection filters(Sobel filters)
MULTI-CHANNEL AND 3D CONVOLUTION FEATURE DETECTORS
We deal with multiple channels in most of the problems we encounter regarding image processing. RGB channels are a typical example of such channel images. In CNN there are multiple such layers(hundreds of channels). Each channel describes different aspects of the previous layers. RGB channels make it easier to understand the multilayer convolution.
Each kernel is applied onto an input channel of the previous layer to generate one output channel. This is a kernel-wise process. We repeat such a process for all kernels to generate multiple channels.
Then these three channels are summed together (element-wise addition) to form one single channel (3 x 3 x 1). This channel is the result of the convolution of the input layer (5 x 5 x 3 matrix) using a filter (3 x 3 x 3 matrix).
We can think of this as sliding a 3D filter matrix through the input.
The 3rd dimension of the kernel has to be the same as the input array. Even a 3D input array, will give us a 2D feature map due to the following calculations...
Here the kernel size = channel size(of input image). The 3D filter moves only in 2-direction, height & width of the image (That’s why such operation is called 2D convolution although a 3D filter is used to process 3D volumetric data).
The result of the above convolution is a 2D output feature map.
But what if (kernel size < channel size(of input image))?
Here in 3D convolution, if the filter depth is smaller than the input layer depth (kernel size < channel size) the 3D filter can move in all 3-directions (height, width, channel of the image).
The output numbers are arranged in a 3D space. The output is then a 3D data.
STRIDES AND PADDING
What are strides?
It defines the step size of the kernel when sliding through the image. Stride controls how the filter translates over the input image. It does so by controlling by how many pixel units the filter shifts each time. Stride of 1 means that the kernel slides through the image pixel by pixel. Stride of 2 means that the kernel slides through an image by moving 2 pixels per step (i.e., skipping 1 pixel). We generally use stride (>= 2) for downsampling an image. The usual and the implicit case is the stride of 1.
In the above example, the red square unit is convolved and the output value is written in the output feature map. A stride of one is taken and the green square unit is convolved to provide an output value in the feature map.
To comprehensively understand the meaning of stride, let’s see what a stride of 2 would look like:
Stride not only controls how many units to the right a filter moves, but also how many units below the previous movement whenever it shifts to a new row.
Suppose we want to apply a filter on an input image but don’t want to change the dimensions of the feature map thus formed. In such cases, we apply a padding of all 0’s.
the padding defines how the border of an image is handled. A padded convolution will keep the spatial output dimensions equal to the input image, by padding 0 around the input boundaries if necessary. On the other hand, unpadded convolution only performs convolution on the pixels of the input image, without adding 0 around the input boundaries. The feature map is smaller in dimensions than the input data.
To calculate the size of the Zero Padding so as to preserve the dimensions of the input image, we can use:
Where K = kernel size, Stride = 1
For an input image with size of W, kernel size of K, padding of P, and stride of S, the output image from convolution has size O:
SOME FINAL INSIGHT ON CONVOLUTION
Mostly the feature maps used are 3x3 matrices but different networks use various feature detectors of different dimensions. For example, Alex-net- A CNN and the leading architecture for any object detection task, uses feature detectors of 7x7 matrices. Each input image is of a certain size and every element is of a specific value. Feature detectors are similar to them just having smaller size.
There is a minor drawback to the way convolution takes place.
Data loss occurs while we create feature maps as we are reducing the size of the image but on the other hand we are highlighting the features by using feature maps. For example, to identify a certain animal we look at specific features like if we want to identify a rabbit we would look at long ears and long teeth and maybe even a carrot. So data loss is actually compensated by the feature enhancement that the feature detector causes on an input image.
This was a brief take on convolution and the working of feature detectors. Convolution is a vast field and many other exceptional concepts still to be explored. We did our best to write all that our little knowledge pocket contains.
We really appreciate feedback and doubts so feel free to hit us up. If you liked this post, please subscribe to us so we can notify you every time we write something new!
And as always, Thanks for reading!