In my last post, I discussed how a convolutional neural network (CNN) learns to classify images (although it should be noted that CNNs can do much more than that) by fine-tuning a set of convolutional filters so that they become sensitive to visual features relevant to the classification task. I gave the example of a network trained to recognize (among other things) pictures of dogs and discussed how it would make sense to imagine such a network learning filters sensitive to eyes, snouts and other animal features, which could be combined to yield a classification.

It should be noted yet again that CNNs are not explicitly programmed to classify images in one way or another – they are free to learn any set of filters that solve the problem. In this sense, it is not explicitly required that CNNs learn a snout filter or an eye filter. It just makes sense that they do, since these features are relevant to the problem at hand.

## How the simplest neural networks are implemented

Armed with the intuition of how CNNs work, in this post I’ll dive into how they are actually implemented in practice. But before we do that, we must understand something more basic, which is how the simplest neural networks are implemented.

In the image above, we have a set of circles connected by arrows. In the neuroscience analogy, the circles are meant to represent artificial neurons and the arrows are meant to represent artificial neural connections. The modern process of training artificial neural networks is inspired by the Hebbian theory of neuroscience, which can be summarized as “neurons that fire together wire together”. Or, in other words, that the brain adjusts the strengths of the neural connections proportionately to how similarly they respond to the same stimuli.

From the standpoint of a computer scientist, the diagram above can be interpreted as a mathematical model. In this model, the yellow circles correspond to the inputs and the green circles correspond to the outputs. The neural weights in blue, purple and green can be treated as parameters for the model. For each possible different combination of parameters, we obtain a different procedure for mapping inputs to outputs. Computationally, each blue neuron receives the sum of all of its inputs, each one of them multiplied by the weight of the corresponding connection. If we have access to a large number of correct input/output examples, in principle we can adapt the parameters’ values in such a way as to minimize the error between the outputs our model computes and the correct outputs, for the corresponding inputs.

## When your inputs are images

The problem is that when your inputs are images, the number of neurons in the input layer becomes very large (10 thousand neurons for a 100×100 image), in such a way that solving this task with a simple neural network architecture becomes very difficult. Think about it: if we have 10 thousand neurons in the first layer and, let’s say, 500 neurons in the second layer, we would have to adjust the values of 10000 x 500 = 5 million parameters. What CNNs try to do is to drastically reduce this number of parameters using an insight called parameter sharing. Instead of connecting every input neuron to every neuron in the following layer, we will connect each input neuron to just a few neurons in the following layer.

We are talking about images, which are 2D signals, but it’s much easier to visualize the architecture for 1D signals.

Suppose you have an audio signal spanning 12 timesteps. It could be described by 12 inputs (X1, X2, …, X12), and suppose you want to compute 4 outputs (Y1, Y2, Y3, Y4) from these inputs. Instead of connecting every input neuron with every output neuron, you could connect every 3 input neurons to an output neuron. Overall you would have to adjust the parameters of 12 connections (as illustrated above). Parameter sharing cuts this number 4-fold by adjusting just 3 parameters and sharing, or repeating them many times. In the diagram above, all red arrows share the same value. The same goes for green and blue arrows.

Of course, this type of approach does not work for all types of problems (otherwise CNNs would be used much more often). In some cases, you may want to combine the values of distant inputs such as X1 and X12 in the same computation. But for image classification tasks this is seldom the case, as we have discussed in the previous post.

Note that in the context of images, adjusting the weights of the convolutional architecture corresponds to learning one of the convolutional filters discussed in the previous post: one would compute a new pixel value for each 3×3 window of pixels by multiplying each of them by weight and adding everything up.

In the next post, I will delve deeper into how CNNs are implemented. 