Close Icon
Article AI Blog CPG General Healthcare High Tech Insurance Life Sciences Saama Research November 16, 2017 16 minute read

Capsule Networks and the Limitations of CNNs

Convolutional Neural Networks are considered the State-of-the-Art in computer vision related Machine Learning tasks. Soham Chatterjee highlights the limitations of CNNs and discusses alternate models that closely mirror the way the human brain work. He uses Professor Geoffrey Hinton’s paper, Dynamic Routing Between Capsules, to establish certain points.

Convolutional Neural Networks, popularly called CNNs, have been around for a while; in fact, they are our go-to algorithm for any Computer Vision related task. CNNs are, as of now, the final answer when it comes to computer vision related Machine Learning tasks. They are used widely in object recognition systems, self-driving cars, etc. They can even be used to create new paintings based on the patterns of famous painters of the past! Part of what makes them so widely used is that they are really good at what they do.

However, in this article, I am trying to accomplish a rather hard task – highlight the limitations of CNNs and present an alternative model of computer vision that is modeled more closely to how the brain actually works. I’ll be referring to Professor Geoffrey Hinton’s paper, called Dynamic Routing Between Capsules (published November 07, 2017), to make certain points.


In the ImageNet competition, which has 1000 labels, the VGG-16 net has a top 5% error of 8.70% and the Inception-V3 net has a top 1% accuracy of 80% in ImageNet. Moreover, CNNs have been shown to be transferable, meaning that they can be trained on a set of images and then, by re-training only a small subset of its layers, they can be used to make predictions on a completely different set of images.

But whenever we have something that works really well, we tend to stop innovating and accept that as the best way of doing it. Henry Ford once said that, “If I had asked people what they wanted, they would have said – faster horses!”. This attitude is also true when it comes to CNNs.

Most of the latest developments in CNNs have come by making them deeper and more complex. We use a large number of training images and push more layers in different combinations to make them perform well. It is almost as if we are trying to brute force a correct prediction from these networks. With more complex networks, we need more computational power, which makes them less scalable and less likely to be actually used in practice.

We seem to have become so accustomed to the fact that they perform so well, that we have stopped trying to figure out where they go wrong and how we might improve them. In case of CNNs, they do go wrong in a couple of places.


That is the name of a talk that Professor Hinton, one of the first pioneers of neural networks, has been delivering for quite a while now. In his talk, he also explains an alternative model to the vision problem that he calls Capsule Networks. He has recently published a paper on this topic, titled Dynamic Routing Between Capsules. We shall look into this model soon, but before that, I’ll highlight some of the areas where CNNs fall short according to Professor Hinton.


One of Professor Hinton’s biggest argument for what is wrong with CNNs is that they do not encode the position and orientation of the object into their predictions.

The main reason that CNNs did not work well initially was because they could only make predictions on an image if the image that they were trained on and the test image were almost perfectly aligned. What this means is that if a CNN is trained on images of dogs sitting in a certain position in a certain part of an image, then they will only be able to identify dogs in test images that are sitting in similar positions in similar parts of the image. Even a slight position shift will throw off the CNN’s prediction.

Machine learning

AlexNet, which was the first major CNN and which won the ImageNet competition in 2012, solved this problem by training the CNN on augmented images. That is, they took an image and created many versions of it by tilting it, inverting it, rotating it and so on. This helped AlexNet to learn an internal representation of an image with many different viewpoints.

 Capsule Networks

That, you would agree, is “very creative” problem-solving. Show a human an image of say an aardvark and it can still identify that aardvark whether it is tilted or inverted and so on.

 Machine learning

CNNs are very bad at encoding these different representations of pose and orientation within themselves. This problem in the CNN is what Professor Hinton called the Invariance vs. Equivariance problem.

The final label of an image or the final prediction of the CNN is viewpoint invariant. So, even if the final label is a face, the CNN should be able to identify it as a face whether it is small or large or upside down and it should create an internal representation of the position of that face.

CNNs do not do that. They completely lose all their internal data about the pose and the orientation of the object and they route all the information to the same neurons that may not be able to deal with this kind of information.

Professor Hinton argues that we should have equivariance in the network – a way to map changes in the viewpoint and pose to changes in the neural activities of the network.

Take a look at this image from his talk. Can you try and identify which continent that is? If you said that it looks a bit like Australia then you are correct. However, if you flip the image and look at it taking the right hand side of the image as the frame of reference, then it sort of looks like Africa, and you would still be correct.

This simple exercise shows that our perception of an image, or the kind of information that we draw from an image, depends heavily upon our point of reference. This makes sense even in our day-to-day life.

Every time we see something, we set up a rectangular point of reference and we derive information from that image based on that rectangular reference, something that CNNs don’t do. Instead, they rely on a single reference point, and sometimes none at all, to understand images. Therefore, if we are to model vision close to how humans see, then CNNs are not the way to go.


Strike two is based on the mechanism that CNNs use to make predictions.

A CNN makes predictions by looking at an image and then checking to see if certain components are present in that image or not. If they are, then it classifies that image accordingly.

If a CNN is asked to identify the first image below, then it will identify it as a face. However, if asked to predict what the image on the right is, it will still identify it as a face!

Capsule Networks

It does this because the CNN only checks to see if certain features, like eyes, ears, a mouth, and a nose, are present in the image. However, the CNN does not check the relative locations of these features to each other.

This shortcoming can be traced back to the pooling layer.

In the pooling layer, we generally take the input vector and then select the largest pixel intensities in our receptive field. This is a non-learnable feature and the reason it is done because it reduces computation cost and also helps select the important features from an image. Basically put, it just performs a sub-sampling of features.

However, at the same time, the pooling layer loses a lot of the positional information about a feature. This means that the pooling layer ignores where certain features have occurred, leading again to invariance among features.

In real life, however, we take into account the relationship of objects with the surroundings and their orientation to identify objects. In CNNs, these orientations and their surroundings are not taken into account.

Moreover, when we see an image and we see the same image slightly darkened, we can still identify it as the same. On the other hand, if you see an image and see a covariant transformation of that image, you will not be able to identify them as the same. It makes sense, therefore, for a neural network that is detecting images to check for the intensities and the pose of the image, instead of its covariance structure, that a CNN usually does.


Finally, how the CNN routes data from the lower levels to the higher levels are fundamentally wrong.

In a CNN, all low-level details are sent to all the higher level neurons. These neurons then perform further convolutions to check whether certain features are present. This is done by striding the receptive field and then replicating the knowledge across all the different neurons.

Professor Hinton argues that instead of having the information go through all the neurons, like in a conventional CNN, it is better to route the image to specific neurons that have the capability to deal with those features. Even in the brain, we have certain areas that are dedicated to deciphering certain kinds of information and it makes sense to do the same in neural networks to get better predictions.

According to Professor Hinton, if a lower level neuron has identified an ear, it then makes sense to send this information to a higher level neuron that deals with identifying faces and not to a neuron that identifies chairs. If the higher level face neuron gets a lot of information that contains both the position and the degree of certainty from lower level neurons of the presence of a nose, two eyes and an ear, then the face neuron can identify it as a face.

His solution is to have capsules, or a group of neurons, in lower layers to identify certain patterns. These capsules would then output a high-dimensional vector that contains information about the probability of the position of a pattern and its pose. These values would then be fed to the higher-level capsules that take multiple inputs from many lower-level capsules.

In the case of a face, the lower level capsules would route information about the pose and the probability of the presence of ears, nose, eyes, etc, to a higher level face capsule. Since capsules are effectively smaller neural networks, the output of each capsule would be quite high dimensional. The face capsule would then take this high dimensional data and be able to tell whether a face is present or not, by comparing the relative “closeness” of the different low-level features.

Since we are dealing with high dimensional data across all the layers, the probability of this “closeness” would get lesser and lesser as the dimensions increase. This would also reduce errors in the system.

In CNNs, by routing all the lower level information to all the higher level layers not only are we increasing our computational cost, but we are also sending data to neurons that are not adept at dealing with such information.

This is the final nail in the coffin for CNNs and the reason why we need an alternative.


Before we look at what the alternative is, let’s see what the problems with CNNs were again:

  1. Orientation and placements of various parts of the objects in the image are not taken into account: A face is not a face if the eyes are located beside the ears and the nose is below the mouth.
  2. Routing of high dimensional data to specific neurons to improve predictions: Send the vector of the mouth prediction to the neurons that deal with the prediction of faces, and not to the one that predicts cars.
  3. There is no encoding of pose and orientation of the image: Neural activities need to be different for same objects with different poses.


One of the motivations behind the capsule net comes from how neurons in the brain are arranged.

In any neural network architecture, there is only a single flow of information. Low-level neurons get information from the input and they translate that into higher and higher dimensions to make sense of the data and find a decision boundary. And to put it in very blunt and basic terms, we decide that the neuron with the loudest bark is telling us the correct answer.

This kind of architecture is inspired by the brain, but it is not how the brain works. In the brain, there are various regions which deal with understanding certain kinds of information. Here, the key word is “region” and not “neurons”. In the brain, an entire group of neurons help in making a decision.

But what is the advantage of using “regions” or a bunch of neurons instead of a single one? Dimensionality. Neurons in a neural network can take an input vector, correlate that with a weight vector, and see if they correlate. But they cannot compare two input or activity vectors.

For example, we have a group of neurons that can identify a wheel and another group of neurons that can identify a car door. Both of these groups of neurons outputs a vector, which obviously is of high dimensionality. Then by being able to compare these two vectors, in high dimensions, we can check whether they “add up” to form a car.

By being able to correlate between two activity vectors, we can extract more information and do a lot of other useful calculations. If two such activity vectors are similar in higher dimensions, then it becomes a much better agreement, as it is in a higher dimension and there is a lesser chance of disparity. Hence, predictions become more accurate and we are able to filter out noise, as high dimensional coincidences do not happen.


A capsule is a group of neurons that has the ability to identify certain kinds of objects and also tell us the position of that object in that image.

A low level capsule says that if it is activated, there is a high probability that the entity it is built to see is present in the image and its location is here. This low-level capsule then routes its information to a higher level capsule that can make sense of that data.

The high-level capsule gets similar data from many different lower level capsules that tells it that it has seen something, an object, and this is how sure it is that the thing it has seen is actually present there. It looks at all this data and finds a tight cluster of predictions and then outputs another vector, which is again fed to another higher level output.

The outcome:

  • Problem 3 is solved: We can transfer knowledge about the pose and orientation of the image.
  • Problem 2 solved: We have achieved Routing and High Dimensionality.

Now let’s see how the capsules work and actually transfer data.


Machine learning

In the image we can see two layers of capsules. The lower layer consists of the capsules that can identify a mouth and a nose.

  • Ti – Has the coordinate/pose of the mouth
  • Pi – Has the probability that the entity is a mouth
  • Th – Has the coordinate/pose of the nose
  • Ph – Has the probability that the entity is a nose

Now, we multiply the coordinate given by the mouth capsule and the coordinate given by the nose capsule with Tij and Thj respectively.

Here, Tij and Thj are weight matrices that after being multiplied, give the position that the face should have with respect to the mouth and nose respectively.

So Tij * Ti and Thj * Th give us two different poses of the face.

If Tij * Ti and Thj * Th are similar, then we have a face.

This brings out a covariance of activities rather than that of weight vectors.

The matrices Tij and Thj are viewpoint invariant and they represent the relationship between the mouth and the face. So even if we get a different viewpoint, the capsules will be able to encode it and as long as the product between the capsules pose and weight vectors match, we can translate that to a successful prediction.

This is how the capsules interact with each other and use pose between objects to make predictions. Now, problem 1 has also been solved.


In his paper, Dynamic Routing using Capsules, Professor Hinton uses just a 3-layer network to achieve nearly state-of-the-art accuracies in MNIST.

The first layer or the bottom layer is convolutional in nature (ironic much?). It has 256 convolutional filters with a receptive field of 9×9 and a stride of 1.

Each filter in this layer converts pixel intensities of the image into pose of certain entities in that image. Along with that, it also gives a value of probability of a certain entity being present. So, in this layer, everything is going to be shared.

The second layer is the “primary capsule” layer. Each capsule contains 8 convolutional units with 9×9 kernels and a stride of 2. There are 32 of these channels. The 32 primary capsules sees the inputs from all the convolution kernels in the first layer. They output a 8D vector. No routing takes place yet.

The last layer is the DigitCaps layer. It has a 16D capsule per digit in MNIST. Routing takes place in between the DigitCaps layer and the Primary Capsule layer.

During training, the routing is initialized to zero, which means the output from the Primary Capsule layer is sent to all the capsules in the DigitCaps layer. Then the routing between the two layers is learned.


Professor Hinton and team trained their caps net on an expanded MNIST dataset. They expanded it by translating all the images by 2 pixels. This expanded MNIST dataset had 60K training examples and 10K testing examples. They also trained a state-of-the-art convolutional net on the same dataset.

The authors of this paper also went one step further and added a decoder network at the end of the capsule network predictions. The decoder network would reconstruct the images that the capsule network thought it saw and also give a prediction of the digit. The results are in the image below


l, p, r represents the label, the prediction and the label of the reconstructed image. The network failed to predict the two rightmost images, but it did reconstruct the second from the right image correctly.

The reconstructed images are not only clearer and less noisy, but they keep all the important details of the digits. Capsules with more routing and reconstruction achieved a lower test accuracy error in both the MNIST and the expanded MNIST data set as can be seen in the table below.

Machine learning

The authors also say that they achieved comparable results with a lesser number of parameters than a most advanced CNN.

The authors also created another dataset called AffMNIST. This dataset contained images that had affine transformations like scale, thickness, localized skew, width, and translation.

When the caps net was tested on this dataset with training on only the expanded MNIST dataset, they achieved a classification accuracy of 79%, whereas a similarly trained CNN achieved an accuracy of only 66%.

In a previous talk, Professor Hinton also mentioned that capsules could achieve the utmost level of accuracy after training on less than one-tenth of the data as compared to CNNs!

These are amazing results that show how the capsule net can generalize better across many different viewpoints and transformations.


Dynamic routing can be seen as a parallel attention mechanism. This means that each capsule can attend to several details, allowing the model to identify multiple objects in an image.

To test this, they created the MultiMNIST dataset which contained two digits with about 80% overlap. The results are shown in the image below.

Capsule Networks

R:(r1, r2) represents the image that was fed to the network with overlap and L:(l1, l2) shows the reconstructed images with green and red representing the two different digits. The two rightmost columns shows an example of a classification and reconstruction error.

The capsule net achieved a classification error rate of 5% in the test dataset. Which is comparable to a similar sequential attention model overlap architecture that was trained on a dataset that had only a 4% overlap!!!

All these results shows us that capsule networks are definitely better at identifying multiple objects and also at generalizing among viewpoints than convolutional neural networks. However, there is more work to be done in this field and we can achieve better results with better architectures and better capsule and routing algorithms.

The authors of the paper say, “There are many possible ways to implement the general idea of capsules. The aim of this paper is not to explore this whole space but simply to show that one fairly straightforward implementation works well and that dynamic routing helps.’

But despite that, with results like the ones above, I believe that the end of our dependence on CNNs is fast approaching.

Image sources:


Saama can put you on the fast track to clinical trial process innovation.