An artificial neural network is a system of hardware and/or software patterned after the way neurons operate in the human brain. Convolutional neural networks (CNNs) apply a variation of multilayer perceptrons (algorithms that classify visual inputs), usually across multiple convolutional layers that are either entirely connected or pooled.
CNNs learn in the same way humans do. People are born without knowing what a cat or a bird looks like. As we mature, we learn that certain shapes and colors correspond to elements that collectively correspond to an element. Once we learn what paws and beaks look like, we’re better able to differentiate between a cat and a bird.
Neural networks essentially work the same way. By processing training sets of labeled images, the machine is able to learn to identify elements that are characteristic of objects within the images.
A CNN is one of the most popular types of deep learning algorithms. Convolution is the simple application of a filter to an input that results in an activation represented as a numerical value. By repeatedly applying the same filter to an image, a map of activations called a feature map is produced. This indicates the locations and strengths of detected features.
A convolution is a linear operation that involves multiplying a set of weights with the input to yield a two-dimensional array of weights called a filter. If the filter is tuned to detect a specific type of feature in the input, then the repetitive use of that filter across the entire input image can discover that feature anywhere in the image.
For example, one filter may be designed to detect curves of a certain shape, another to detect vertical lines, and a third to detect horizontal lines. Other filters may detect colors, edges, and degrees of light intensity. Connecting the output of multiple filters can reveal complex shapes that matched known elements in the training data.
A CNN usually consists of three layers: 1) an input layer, 2) an output layer, and 3) a hidden layer that includes multiple convolutional layers. Within the hidden layers are pooling layers, fully connected layers, and normalization layers.
The first layer is typically devoted to capturing basic features such as edges, color, gradient orientation, and basic geometric shapes. As layers are added, the model fills in high-level features that progressively determine that a large brown blob first is a vehicle, then a car, and then a Buick.
The pooling layer progressively reduces the spatial size of the representation for more efficient computation. It operates on each feature map independently. A common approach used in pooling is max pooling, in which the maximum value of an array is captured, reducing the number of values needed for calculation. Stacking convolutional layers allows the input to be decomposed into its fundamental elements.
Normalization layers regularize the data to improve the performance and stability of neural networks. Normalization makes the inputs of each layer more manageable by converting all inputs to a mean of zero and a variance of one.
Fully connected layers are used to connect every neuron in one layer to all the neurons in another layer.