Nowadays, Deep Learning is often the first thing that comes to people’s minds when they hear about Artificial Intelligence. This is mainly because deep learning has become the “de facto” approach to solve a wide range of problems in the last decade, following the so-called Deep Learning Revolution. However, Deep Learning (the study of methods based on artificial neural networks) is actually just a subfield of Machine Learning (the study of algorithms that “learn” to solve problems by example), which is itself a subfield of Artificial Intelligence (the study of intelligent machines).

Source: Wikipedia (2019).

The “Deep Learning Revolution”

The “Deep Learning Revolution” is characterized by a period, starting in 2012, in which deep learning models started to make substantial progress on important problems. This period is historically very significant, as Artificial Neural Networks have been mocked upon by AI researchers almost immediately after their inception (they were first proposed in 1943). The deep learning revolution came to show that these models were not insufficiently powerful, as most believed until then, just lacking the required amount of computing power and training data to perform well. Thus, in a sense, big data paved the way for deep learning.

Convolutional Neural Networks (CNNs) had a big role to play in this revolution. In 2012, a research paper by Geoffrey Hinton’s lab – Hinton is one of the recipients of this year’s Turing Award, along with Yann LeCun and Yoshua Bengio – won the ImageNet (an image classification competition) by a large margin. Their approach relied on a CNN – a simple, yet very powerful deep learning model.

What exactly are CNNs and how do they work?

The first aspect is to understand what a *convolution* is. Despite the fancy name, a convolution is just an element-wise weighted sum. In other words, when you apply a convolution to an image, you are replacing the value of each pixel by a weighted sum of the values of its neighbors. Each weighted sum yields a possibly different result, and some of them are very useful.

Possibly the simplest example is just a regular, arithmetic mean over pixel values. For example, consider this grayscale picture of a dog:

Source: Wikipedia (2019).

If you replace the value of each pixel by the average value of its neighbors, it becomes a little bit blurred.

With each successive convolution, the image becomes more blurred. See below the result of applying 5 convolutions in sequence:

If you replace the value of each pixel by the difference between itself and the average of its neighbors, you obtain a very different result, which can be interpreted as a map of the edges of the original image. That is, regions of sharp contrast (like the boundary between the dog’s ears and the background) are enhanced, while homogeneous regions such as the dog’s fur are erased. This is an example of a high-pass filter (low frequencies are erased, high frequencies remain), while the previous “blur” filter is an example of a low-pass filter (high frequencies are erased, low frequencies remain).

Both filters are very useful and have been known to the field of image processing for a long time. Many other filters exist, each one with a particular purpose. But up until CNNs, they had to be designed by humans. CNNs learn these filters upon being fed millions of input/output examples. For instance, millions of images of cats or dogs as inputs and 0s and 1s as outputs. The goal of a particular CNN can be to map the input images to the output labels (0 for cat and 1 for dog). They do that by learning several filters that are applied to the input image. The result is several filtered versions of the original image, each capturing some visual aspect that the model has learned to weigh as relevant.

CNNs do not solve the problem simply by learning a number of filters and applying them to the original image. What they actually do is repeat this process multiple times. They apply an initial layer of filters to obtain activations (filtered versions of the input image), and then apply another layer of filters over the activations to obtain new activations. This process is repeated many times (5 in the case of the network used to produce the images above, AlexNet). The idea is to reduce the size of the images with each successive layer, to abstract irrelevant information and retain only the most important features in order to perform a final prediction. For a cat/dog classification network, that would be snouts, ears, eyes, mouths and other defining features. But CNNs learn these complex features step by step, first detecting simple shapes, such as lines and circles, and combining them to detect more complex features at each successive layer. See for example some activations of the second AlexNet convolutional layer below:

At some point, the activations stop making sense for a human observer, but they encapsulate the most important for the network when making a prediction. Each of the several filters in the last layer of a network such as AlexNet is sensitive to a high-level feature of the input examples. It is conceivable to imagine that such filter could be sensitive to dog snouts, for example. This filter’s activation would be brighter near the dog snout pixels and darker everywhere else. Just like the “edge” filter yielded a map of the image’s edges, the filters at the last layer of a trained CNN yield maps of much more complex features.


The activations of the network’s last layer (above) are finally weighed by a dense neural network to produce a final prediction. Concretely, the dense net will compute the probability that the image belongs to any given image class. In our example, this corresponds to the one thousand classes of the AlexNet model. Of these one thousand classes, the top 10 for which AlexNet computes the largest probabilities upon being fed the example image are the following:

  1. Norwich terrier: 29%
  2. Wire-haired fox terrier: 21%
  3. Norfolk terrier: 8%
  4. Pembroke: 7%
  5. Lakeland terrier: 6%
  6. Australian terrier: 4%
  7. Chihuahua: 3%
  8. Collie: 3%
  9. Cardigan: 2%
  10. Irish terrier: 2%

These results are not bad, considering that the image shows a mixed-breed terrier. Despite their simplicity, CNNs are extremely accurate in image classification tasks. They have also found application in audio processing, image generation, video analysis, natural language processing and even played a small role in using AI to defeat human champions in the board game Go. In the next post, I will delve deeper into how CNNs are implemented and how to train one.

About the author

Marcelo is a Data Scientist at Poatek. He holds a Bachelor’s Degree in Computer Science from the Federal University of Rio Grande do Sul (UFRGS) and a PhD from the same institution, with a Major in Machine Learning. Marcelo is mostly interested in the Ethics of Artificial Intelligence and neural-symbolic learning, particularly deep learning models over graphs. He is also deeply interested in creative coding and generative art (producing artworks with computer programs).