Choosing the right activation function for your neural network
An activation function is the part of a neural network that simply defines when a neuron fires. Consider it a sort of tipping point: Input of a certain value won’t cause the neuron to fire because it’s not enough, but just a little more input can cause the neuron to fire. A neuron is defined in a simple manner as follows:y = ∑ (weight * input) + bias
The output, y
, can be any value between + infinity and – infinity. The problem, then, is to decide on what value of y
is the firing value, which is where an activation function comes into play in your neural network. The activation function determines which value is high or low enough to reflect a decision point in the neural network for a particular neuron or group of neurons.
As with everything else in neural networks, you don't have just one activation function. You use the activation function that works best in a particular scenario. With this in mind, you can break the activation functions into these categories:
- Step: A step function (also called a binary function) relies on a specific threshold for making the decision about activating or not. Using a step function means that you know which specific value will cause an activation. However, step functions are limited in that they’re either fully activated or fully deactivated —no shades of gray exist. Consequently, when attempting to determine which class is most likely correct based in a given input, a step function won’t work.
- Linear: A linear function (
A = cx
) provides a straight-line determination of activation based on input. Using a linear function helps you determine which output to activate based on which output is most correct (as expressed by weighting). However, linear functions work only as a single layer. If you were to stack multiple linear function layers, the output would be the same as using a single layer, which defeats the purpose of using neural networks. Consequently, a linear function may appear as a single layer, but never as multiple layers. - Sigmoid: A sigmoid function (
A = 1 / 1 + e-x
), which produces a curve shaped like the letter C or S, is nonlinear. It begins by looking sort of like the step function, except that the values between two points actually exist on a curve, which means that you can stack sigmoid functions to perform classification with multiple outputs. The range of a sigmoid function is between 0 and 1, not – infinity to + infinity as with a linear function, so the activations are bound within a specific range. However, the sigmoid function suffers from a problem called vanishing gradient, which means that the function refuses to learn after a certain point because the propagated error shrinks to zero as it approaches far away layers. - Tanh: A tanh function (
A = (2 / 1 + e-2x) – 1
) is actually a scaled sigmoid function. It has a range of –1 to 1, so again, it's a precise method for activating neurons. The big difference between sigmoid functions and tanh functions is that the tanh function gradient is stronger, which means that detecting small differences is easier, making classification more sensitive. Like the sigmoid function, tanh suffers from vanishing gradient issues. - ReLU: A ReLU, or Rectified Linear Units, function (
A(x) = max(0, x)
) provides an output in the range of 0 to infinity, so it’s similar to the linear function except that it’s also nonlinear, enabling you to stack ReLU functions. An advantage of ReLU is that it requires less processing power because fewer neurons fire. The lack of activity as the neuron approaches the 0 part of the line means that there are fewer potential outputs to look at. However, this advantage can also become a disadvantage when you have a problem called the dying ReLU. After a while, the neural network weights don’t provide the desired effect any longer (it simply stops learning) and the affected neurons die — they don’t respond to any input.
- ELU (Exponential Linear Unit): Differs from ReLU when the inputs are negative. In this case, the outputs don’t go to zero but instead slowly decrease to –1 exponentially.
- PReLU (Parametric Rectified Linear Unit): Differs from ReLU when the inputs are negative. In this case, the output is a linear function whose parameters are learned using the same technique as any other parameters of the network.
- LeakyReLU: Similar to PReLU but the parameter for the linear side is fixed.
Relying on a smart optimizer for your neural network
An optimizer serves to ensure that your neural network performs fast and correctly models whatever problem you want to solve by modifying the neural network’s biases and weights (see this article for more on improving your machine learning models). It turns out that an algorithm performs this task, but you must choose the correct algorithm to obtain the results you expect. As with all neural network scenarios, you have a number of optional algorithm types from which to choose:Stochastic gradient descent (SGD)
- RMSProp
- AdaGrad
- AdaDelta
- AMSGrad
- Adam and its variants, Adamax and Nadam
You can categorize the optimizer functions by the manner in which they deal with the derivative (dy/dx), which is the instantaneous change of y with respect to x. Here are the two levels of derivative handling:
- First order: These algorithms minimize or maximize the objective function using gradient values with respect to the parameters.
- Second order: These algorithms minimize or maximize the object function using the second-order derivative values with respect to the parameters. The second-order derivative can give a hint as to whether the first-order derivative is increasing or decreasing, which provides information about the curvature of the line.
Setting a working learning rate in your neural network
Each optimizer has completely different parameters to tune in your neural network. One constant is fixing the learning rate, which represents the rate at which the code updates the network’s weights (such as thealpha
parameter). The learning rate can affect both the time the neural network takes to learn a good solution (the number of epochs) and the result. In fact, if the learning rate is too low, your network will take forever to learn. Setting the value too high causes instability when updating the weights, and the network won't ever converge to a good solution.Choosing a learning rate that works and training your neural network is daunting because you can effectively try values in the range from 0.000001 to 100. The best value varies from optimizer to optimizer. The value you choose depends on what type of data you have. Theory can be of little help here; you have to test different combinations before finding the most suitable learning rate for training your neural network successfully.
In spite of all the math surrounding them, tuning neural networks and having them work best is mostly a matter of empirical efforts in trying different combinations of architectures and parameters.
Take the time to evaluate the learning rate and set it appropriately to ensure your neural network functions optimally.