Deep learning isn’t simply a rebranding of an old technology, the perceptron, discovered in 1957 by Frank Rosenblatt at the Cornell Aeronautical Laboratory. Deep learning works better because of the extra sophistication it adds through the full use of powerful computers and the availability of better (not just more) data. Deep learning also implies a profound qualitative change in the capabilities offered by the technology along with new and astonishing applications. The presence of these capabilities modernizes old but good neural networks, transforming them into something new. The following article describes just how deep learning achieves its task.
Adding more layers for deep learning
You may wonder why deep learning has blossomed only now when the technology used as the foundation of deep learning existed long ago. Computers are more powerful today, and deep learning can access huge amounts of data. However, these answers point only to important problems with deep learning in the past, and lower computing power along with less data weren’t the only insurmountable obstacles. Until recently, deep learning also suffered from a key technical problem that kept neural networks from having enough layers to perform truly complex tasks.Because it can use many layers, deep learning can solve problems that are out of reach of machine learning, such as image recognition, machine translation, and speech recognition. When fitted with only a few layers, a neural network is a perfect universal function approximator, which is a system that can recreate any possible mathematical function. When fitted with many more layers, a neural network becomes capable of creating, inside its internal chain of matrix multiplications, a sophisticated system of representations to solve complex problems. To understand how a complex task like image recognition works, consider this process:
- A deep learning system trained to recognize images (such as a network capable of distinguishing photos of dogs from those featuring cats) defines internal weights that have the capability to recognize a picture topic.
- After detecting each single contour and corner in the image, the deep learning network assembles all such basic traits into composite characteristic features.
- The network matches such features to an ideal representation that provides the answer.
One of the earliest achievements of deep learning that made the public aware of its potentiality is the cat neuron. The Google Brain team, run at that time by Andrew Ng and Jeff Dean, put together 16,000 computers to calculate a deep learning network with more than a billion weights, thus enabling unsupervised learning from YouTube videos. The computer network could even determine by itself, without any human intervention, what a cat is, and Google scientists managed to dig out of the network a representation of how the network itself expected a cat should look (see the Wired article discussing neural networks).
During the time that scientists couldn’t stack more layers into a neural network because of the limits of computer hardware, the potential of the technology remained buried, and scientists ignored neural networks. The lack of success added to the profound skepticism that arose around the technology during the last AI winter. However, what really prevented scientists from creating something more sophisticated was the problem with vanishing gradients.A vanishing gradient occurs when you try to transmit a signal through a neural network and the signal quickly fades to near zero values; it can’t get through the activation functions. This happens because neural networks are chained multiplications. Each below-zero multiplication decreases the incoming values rapidly, and activation functions need large enough values to let the signal pass. The farther neuron layers are from the output, the higher the likelihood that they’ll get locked out of updates because the signals are too small and the activation functions will stop them. Consequently, your network stops learning as a whole, or it learns at an incredibly slow pace.
Every attempt at putting together and testing complex networks ended in failure because the backpropagation algorithm couldn’t update the layers nearer the input, thus rendering any learning from complex data, even when such data was available at the time, almost impossible. Today, deep networks are possible thanks to the studies of scholars from the University of Toronto in Canada, such as Geoffrey Hinton, who insisted on working on neural networks even when they seemed to most to be an old-fashioned machine learning approach.
Professor Hinton, a veteran of the field of neural networks (he contributed to defining the backpropagation algorithm), and his team in Toronto devised a few methods to circumvent the problem of vanishing gradients. He opened the field to rethinking new solutions that made neural networks a crucial tool in machine learning and AI again.
Professor Hinton and his team are memorable also for being among the first to test GPU usage in order to accelerate the training of a deep neural network. In 2012, they won an open competition, organized by the pharmaceutical company Merck and Kaggle (the latter a website for data science competitions), using their most recent deep learning discoveries. This event brought great attention to their work. You can read all the details of the Hinton team’s revolutionary achievement with neural network layers from this Geoffrey Hinton interview.
Changing the activations for deep learning
Geoffrey Hinton’s team was able to add more layers to a neural architecture because of two solutions that prevented trouble with backpropagation:- They prevented the exploding gradients problem by using smarter network initialization. An exploding gradient differs from a vanishing gradient because it can make a network blow up as the exploding gradient becomes too large to handle.
Your network can explode unless you correctly initialize the network to prevent it from computing large weight numbers. Then you solve the problem of vanishing gradients by changing the network activations.
- The team realized that passing a signal through various activation layers tended to damp the backpropagation signal until it becomes too faint to pass anymore after examining how a sigmoid activation worked. They used a new activation as the solution for this problem. The choice of which algorithm to use fell toward an old activation type of ReLU, which stands for rectified linear units. An ReLU activation stopped the received signal if it was below zero assuring the non-linearity characteristic of neural networks and letting the signal pass as it was if above zero. (Using this type of activation is an example of combining old but still good technology with current technology.) The image below shows how this process works.
The ReLU worked incredibly well and let the backpropagation signal arrive at the initial deep network layers. When the signal is positive, its derivative is 1. You can also find proof of the ReLU derivative in looking. Note that the rate of change is constant and equivalent to a unit when the input signal is positive (whereas when the signal is negative, the derivative is 0, thus preventing the signal from passing).
You can calculate the ReLU function using f(x)=max(0,x)
. The use of this algorithm increased training speed a lot, allowing fast training of even deeper networks without incurring any dead neurons. A dead neuron is one that the network can't activate because the signals are too faint.
Adding regularization by dropout for deep learning
The other introduction to deep learning made by Hinton’s team to complete the initial deep learning solution aimed at regularizing the network. A regularized network limits the network weights, which keeps the network from memorizing the input data and generalizing the witnessed data patterns.Remember, certain neurons memorize specific information and force the other neurons to rely on this stronger neuron, causing the weak neurons give up learning anything useful themselves (a situation called co-adaptation). To prevent co-adaptation, the code temporary switches off the activation of a random portion of neurons in the network.
As you see from the left side of the image below, the weights normally operate by multiplying their inputs into outputs for the activations. To switch off activation, the code multiplies a mask made of a random mix of ones and zeros with the results. If the neuron is multiplied by one, the network passes its signal. When a neuron is multiplied by zero, the network stops its signal, forcing others neurons not to rely on it in the process.
Dropout works only during training and doesn’t touch any part of the weights. It simply masks and hides part of the network, forcing the unmasked part to take a more active role in learning data patterns. During prediction time, dropout doesn’t operate, and the weights are numerically rescaled to take into account the fact that they didn’t work all together during training.