There are multiple convolutional filters available for us to use in Convolutional Neural Networks (CNNs) to extract features from images. In this blog, I will explain how these different convolution operations work in depth and illustrate some design techniques for different filters. To learn more about CNNs and their drawbacks, you can read my previous article.
CNN have brought about huge changes in computer vision and other image related tasks. Although it is an old algorithm, it did not get much attention until 2012 due to lack of data and computational resources. When the first CNN (AlexNet) was used in the ImageNet competition in 2012, it improved the accuracy of prediction, by more than 15% as compared to the 2011 winner. So the convolution boom began.
Research and advancements in the CNN algorithm and architectures have turned CNN into a big hammer capable of nailing down any problem related to computer vision. They have displayed better accuracy than humans in classifying images.
But rather than talking about how CNNs work, in this article we will focus on the heart of the CNN algorithm and what makes them so powerful: the convolution operation.
WHAT IS A CONVOLUTION?
A convolution is an operation that changes a function into something else. We do convolutions so that we can transform the original function into a form to get more information.
Convolutions have been used for a long time in image processing to blur and sharpen images, and perform other operations, such as, enhance edges and emboss.
Here, the original image is the one on the left and the matrix of numbers in the middle is the convolutional matrix or filter.
A convolution operation is an element wise matrix multiplication operation. Where one of the matrices is the image, and the other is the filter or kernel that turns the image into something else. The output of this is the final convoluted image.
If the image is larger than the size of the filter, we slide the filter to the various parts of the image and perform the convolution operation. Each time we do that, we generate a new pixel in the output image.
The number of pixels by which we slide the kernel is known as the stride. The stride is usually kept as 1, but we can increase it. When increased, we might have to increase the size of the image by a few pixels to fit in the kernel at the edges of the image. This increase is called padding.
I’ll talk more about how this can help us get more information from an image in a later section.
CONVOLUTIONAL FILTERS IN MACHINE LEARNING
Convolutions aren’t a new concept. They have been used in image and signal processing for a long time. However, convolutions in machine learning are different than those in image processing.
In image processing, there are a set few filters that are used to perform a few tasks. For example, a filter that can be used to blur images may look like this:
Whereas, a filter that does the opposite, sharpen an image, looks like this:
Other filters, like sobel filters, can perform an edge detection and other operations.
In CNNs, filters are not defined. The value of each filter is learned during the training process.
By being able to learn the values of different filters, CNNs can find more meaning from images that humans and human designed filters might not be able to find.
More often than not, we see the filters in a convolutional layer learn to detect abstract concepts, like the boundary of a face or the shoulders of a person. By stacking layers of convolutions on top of each other, we can get more abstract and in-depth information from a CNN.
A second layer of convolution might be able to detect the shapes of eyes or the edges of a shoulder and so on. This also allows CNNs to perform hierarchical feature learning; which is how our brains are thought to identify objects.
In the image, we can see how the different filters in each CNN layer interprets the number 0.
It is this ability of CNNs to be able to detect abstract and complex features that makes them so attractive in image recognition problems.
Depending on the kind of problem we are solving and the types of features we are trying to learn, we use different kinds of convolutions.
THE 2D CONVOLUTION LAYER
The most common type of convolution that is used is the 2D convolution layer, and is usually abbreviated as conv2D. A filter or a kernel in a conv2D layer has a height and a width. They are generally smaller than the input image and so we move them across the whole image. The area where the filter is on the image is called the receptive field.
Working: Conv2D filters extend through the three channels in an image (Red, Green, and Blue). The filters may be different for each channel too. After the convolutions are performed individually for each channels, they are added up to get the final convoluted image. The output of a filter after a convolution operation is called a feature map.
Each filter in this layer is randomly initialized to some distribution (Normal, Gaussian, etc.). By having different initialization criteria, each filter gets trained slightly differently. They eventually learn to detect different features in the image.
If they were all initialized similarly, then the chances of two filters learning similar features increase dramatically. Random initialization ensures that each filter learns to identify different features.
Since each conv2D filter learns a separate feature, we use many of them in a single layer to identify different features. The best part is that every filter is learnt automatically.
Each of these filters are used as inputs to the next layer in the neural network.
If there are 8 filters in the first layer and 32 in the second, then each filter in the second layer sees 8 filter inputs. Meaning that we get 32X8 feature maps in the second layer. Each of the 8 feature maps of a single filter are added to get a single output from each layer.
What the conv2D layer is doing: Each filter in the conv2D layer is a matrix of numbers. The matrix corresponds to a pattern or feature that the filter is looking for.
In the image below, the filter is looking for a curved line. That curved line could correspond to the back of a mouse, or a part of the numbers 8, 9, 0, etc. Whenever the filter comes across a pattern like that in the image, it gives a high output.
Although this may seem like a very simple exa
mple, most conv2D filters in the first layer of a CNN search for similar features. It also means that the same filter can be used to extract information from multiple types of images (mouse, numbers, faces and so on).
Where the conv2D layer is used: These are used in the first few convolutional layers of a CNN to extract simple features. They have also been used in capsule networks. Previously, they were the sole filters used and they made up most of a CNN. For example, the original LeNet architecture and the AlexNet architecture mostly used conv2D filters.
Nowadays, with advancements in convolutional layers and filters, more sophisticated filters have been designed that can serve different purposes and can be used for different applications. We’ll look at some of them later on.
How to use them while designing a CNN: Conv2D filters are used only in the initial layers of a Convolutional Neural Network. They are put there to extract the initial high level features from an image.
While there are many rules of thumb for designing such filters, they are generally stacked with an increasing number of filters in each layer. Each successive layer can have two to four times the number of filters in the previous layer. This helps the network learn hierarchical features.
Limitations of the conv2D layer: The conv2D layer works fairly impressively. However, they do have certain limitations, which prompted researchers to find alternatives to the conv2D layer.
Their biggest limitation is the fact that they are very computationally expensive. A large conv2D filter will take a lot of time to compute and stacking many of them in layers will increase the amount of computations.
An easy solution to this is to decrease the size of the filters and increase the strides. While you can do that, it also reduces the effective receptive field of the filter and reduces the amount of information that it can capture. In fact, in the first Convolutional Network paper, Yann Le Cunn, had mentioned his fear of having a 1×1 convolutional filter.
However, before we look at the other types of convolutions, it is best to get a more intuitive understanding of filters.
THE DILATED OR ATROUS CONVOLUTION
Conv2D layers are generally used for achieving high accuracy in image recognition tasks. However, they require a lot of calculations to be done and are very RAM intensive.
Dilated or Atrous Convolutions reduces the complexity of the convolution operation. This means that they can be used in real time applications and in applications where the processing power is less like in smartphones.
How Dilated Convolutions Work: The dilated convolution decreases the computational cost by adding another parameter to the conv2D kernel called the dilation rate. The dilation rate is the spacing between each pixel in the convolutional filter.
A 3×3 kernel with a dilation rate of 2 will have the same view field of a 5×5 kernel. This will increase our field of perception but not increase our computational cost.
Increasing the field of view has some the added advantage of increasing the receptive field. This helps the filter capture more contextual information.
Moreover, by detecting details in higher resolutions, we can detect finer details in the images.
What the Dilated Convolution is doing: The figures below shows dilated convolutions.
The first image shows a normal conv2D filter. Here, the number of parameters is equal to the size of the receptive field. The red dots shows the pixel values that are used for calculating the convolution.
In the second image, we add a dilation rate of 2, which increases the receptive field to 7×7. This means that by increasing the dilation rate, we can increase the receptive field exponentially by linear change of parameters.
The receptive field can be further increased by increasing the dilation rate more as like the image in the far right. By increasing the receptive field, we can integrate more knowledge of the context with less cost.
Where are Dilated Convolutions Used: Dilated convolutions are useful when it comes to image segmentation tasks.
Traditionally, image segmentation requires us to perform a downsampling using a convolutional layer and then an upsampling using a deconvolution layer (more on deconvolutions later). The upsampling is done to keep the size of the output image same. This introduces more parameters that are required to be learnt. Using a dilated convolution layer avoids the need for upsampling.
Dilated convolutions can also be used to detect finer details in images of higher resolution. They can give us a wide field of view, and perform the task of several layers of convolutions, without the extra computation cost.
Dilated convolutions were used in the WaveNet architecture that converts text to speech.
How to use the dilated convolution: The dilated convolution is used whenever we need to reduce our computational cost. If you are designing a CNN to run in a smartphone or in a IoT device in real time, you can use dilated convolutions to reduce the number of parameters that need to be computed.
In image segmentation tasks, a dilated convolution is used to keep the input and output images the same size.
Limitations of the Dilated Convolution: Diluted convolutions do not have any established limitations as of yet since they have only been used in very specific applications.
However, one disadvantage could be that since we are using a single pixel value to represent an area of pixels, the dilated convolution could be prone to the same spatial loss error that the pooling layers are prone to.
SEPARABLE CONVOLUTION
The problem with the traditional convolution layer is that it has too many parameters. A 3×3 convolution layer has 9 parameters and this number increases by a power of 2 every time we increase our kernel size.
The challenge with too many parameters is that not only does it take a long time to learn these parameters while training, but it also takes equally long to make predictions while performing inference.
The dilated convolution decreases the number of parameters and makes it useful for performing image segmentation. The separable convolution reduces the computational cost and the number of parameters further so that it can be used in mobile and IoT devices.
How the Separable Convolution works: A convolution is a vector multiplication that gives us a certain result. We can get the same result, by multiplying with two smaller vectors. This is better understood with an example.
A normal 3×3 convolution has 9 parameters. The same 3×3 convolution can be computed as one 1×3 convolution followed by a 3×1 convolution. By applying them one after another, we can get the same effect as a 3×3 convolution. But we have now reduced the number of parameters to 6, thereby reducing our computational cost.
Different filters are merged or added in separable convolutions in a different way. To get the same effect as a 3×3 convolution, we cannot merge the separate 1×3 and 3×1 convolutional feature maps. Instead, we merge them at the end of the 3×1 convolution operation.
Where is it used: The separable convolution has been used immensely in the Xception CNN. The Xception CNN was designed by F Chollet who is also the author of the Keras Deep Learning Library. Xception was designed keeping
in mind that most other CNN architectures before that were too large to be used in real time in mobile or IoT devices.
By reducing the number of parameters, the Xception model became a CNN that could be deployed in the real world. The graph below shows a comparison of the accuracy and the number of parameters in the different CNN models.
Xception and Inception, owing to their smaller sizes have found extensive use in embedded systems like in Raspberry Pi’s and Arduino’s.
DECONVOLUTION
Before I explain what deconvolution is, it is important to understand what a convolution layer is doing intuitively.
A convolution converts all the pixels in its receptive field into a single value. When we apply a convolution to an image, we are not only decreasing the image size (downsampling), but we are also bringing all the information in the field together into a single pixel.
The final output of the convolutional layer is a vector. This vector can be called an image embedding as it contains all the information needed to understand what is present in that image.
This turns out to be important when we are trying to predict what is in an image as it concentrates the important features. At the same time though, it reduces the resolution of images, which is great for predicting classes, but not for image segmentation or generation. In these cases, we need to go from a lower resolution image to a higher one. This is where a deconvolution layer comes in.
The name deconvolution is not apt, because we aren’t actually performing a deconvolution. A convolution is a multiplication in a Fourier space and a deconvolution on a convoluted image (in image processing) gives us the original image.
Deconvolution (in its image processing essence) cannot be done in machine learning, as a Gaussian blurring of an image, in case of a convolutional layer, is an invertible process. What this means is that the deconvolution operation is a black box of sorts, and no one has quite figured out how to get the original pixel values back from a convoluted image. In machine learning, however, a deconvolution is another convolution layer, that spaces out the pixels and performs an up sampling.
A convolution integrates the information spread over a large area into a single pixel; whereas a deconvolution spreads that information out over a large area.
Deconvolutions have another name to prevent confusion: Transposed Convolutions.
How the deconvolution works: It is not possible to interpolate a single pixel value into many values.
Convolution Deconvolution
The quick and dirty solution is to do some fancy padding to the image and then apply a convolution operation. This operation will obviously not result in the same pixel values of the original image. In fact, it might result in some very weird or a lot of zero values. However, it does bring back the image to its original spatial dimensions, which is what we were trying to do initially.
Where to use the Deconvolution layer: As I mentioned before, deconvolutions are used in image segmentation tasks.
When we apply a convolution, we reduce the spatial dimensions of the image. Deconvolutions are done after the convolution layer to keep the size of the output image same as before.
POOLING LAYER
Note: Pooling is not a convolutional layer, but we are talking about it here, as it is a layer that is used commonly in CNNs.
The pooling layer was introduced for two main reasons: The first was to perform downsampling, that is, to reduce the amount of computation that needs to be done, and the second to send only the important data to the next layers in the CNNs.
How they work: There are two kinds of pooling layers: max pooling and average pooling.
In max pooling, we take only the value of the largest pixel among all the pixels in the receptive field of the filter. In the case of average pooling, we take the average of all the values in the receptive field.
There are many arguments about which one is better and there are many rules of thumb about when to use which, but, typically max pooling is more commonly used.
Since we try to downsample the input vector, pooling kernels do not overlap, i.e., they have a stride that is larger than the size of the kernel itself.
Limitations of the Pooling layer: Downsampling by pooling causes lots of problems in CNNs.
As mentioned by Geoffrey Hinton, the pooling layer loses positional information about the different objects inside the image. This is why, many new architectures have stopped using the pooling layer altogether.
The pooling layer was introduced to reduce computational time and complexity by reducing the number of parameters. With the rise in computational power and the presence of better ways of downsampling, like Separable and Dilated Convolutions, the pooling layer can be cast aside.
Conclusion
Different convolutional operations can be used to perform different functions and get results as per our needs. A traditional conv2D can be used when we want good results and we have abundant computational resources. On the other hand, if we need to run our CNN on embedded devices, we can use separable convolutions to reduce our computational needs.