Updated: Apr 10, 2020
Activation functions are mathematical equations that determine the output of a neural network.It is attached to a neuron and helps determine if the neuron should be fired or not and depending on how relevant it is to the neural network’s output. It also helps normalize the output. In addition to that the activation function should be computationally efficient. There are so many functions going on in hundreds of neurons so it is important that the computational strain is minimized.
Each neuron in the neural network gets an input and that input based on the weights and the bias is treated and converted into a value X. This value X is feeded into the activation function F(x) and the output generated from the activation function is fed into the next layer of neurons.
The neural network needs an activation function because the activation function generates a nonlinear function that may be complex but it’s complexity provides a better chance at learning complex functional mappings from the data.
Without the activation functions, the neural network is just a linear regression model which has limited power and the results might not be satisfactory most number of times.
There are 3 classes of activation functions
Binary step function
A binary step function is a threshold-
based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.
The issue with binary step function is that it does not allow multi value outputs. It is unable to support the classification of inputs into several outputs.
Linear activation function
A linear activation function takes in the value of input after it has been treated with the weights and then converts it to a signal proportional to the input. It provides the advantage of producing multiple outputs.
There are two major drawbacks of the linear function
It is unable to execute backpropagation(Gradient Descent) in the neural network due to its unavailability to yield a derivative of the input which is dependent of X because the output is a linear function whose derivative is a constant. The constant value is unable to update the weights so a better prediction cannot be provided.
All Layers of the neural network become insignificant and collapse into one layer. The neuron provides linear outputs and independent of the number of layers the final layer will be a linear function of the first layer. This ends up in just one layer inside the neural network.
Non-linear activation function
Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data. Any imaginable process can be represented as functional computation in the neural network using non-linear activation functions.
They allow backpropagation because the outputs provided by them can affect the weights to change the prediction. Multiple hidden layers can be created using these functions.
TYPES OF NON LINEAR ACTIVATION FUNCTIONS
Sigmoid function provides a smooth gradient and prevents jumps in output. It normalises the input value between 0 and 1. It enables a better insight at predictions as the values increase. As the value crosses 2 or -2, the values close in on 1 and 0 respectively. As the values increase there is almost no change in the output value. The more the increment the lesser the spontaneous increase. This generates a problem of vanishing gradient. It is also Computationally expensive.
The TanH function is zero centered which is its one advantage over the sigmoid function. It is similar to the sigmoid function in all the other aspects.
ReLU(Rectified Linear Unit)
One of the biggest advantages of using ReLU function is that it is computationally efficient. Although it looks like a linear function, It is not. ReLU has its own derivative and allows backpropagation. On the contrary to this beautiful function, there is a drawback. As it approaches zero, the gradient becomes zero too and cannot provide backpropagation and hence cannot learn.(The dying ReLU problem)
It prevents the dying ReLU problem. The variation of slope has a small positive value in the negative area so it enables backpropagation even for negative values. But due to this it does not provide consistent results in the negative input values.
Parametric ReLU provides a scope for the model to learn the most appropriate value of the parameter for slope and adjust the slope accordingly toward the negative side. This provides a better backpropagation and increases the accuracy of predictions.
Softmax function is able to handle multiple classes in contrast to only one class in other activation functions. It normalizes the outputs for each class between 0 and 1, and divides by their sum, giving the probability of the input value being in a specific class. Generally softmax functions are used in output layers in a neural network for getting a probabilistic output and classifying the inputs into multiple classes.
It is a brand new activation function developed by researchers at Google. It is known to perform better than ReLU at a very similar computational level.
WHICH FUNCTION TO USE AND WHY?
Knowing about the different activation functions and their strengths and weaknesses is all cool but the main question that arises is that which function will be the best for my model?
All the given functions have their advantages and drawbacks.
ReLU function is an important function but it should be only used within the hidden layers of the neural networks. Another problem with ReLu is that some gradients can be fragile during training and can die. It can cause a weight update which will cause it never to activate on any data point again. Simply saying that ReLu could result in Dead Neurons. To fix this problem we use leaky ReLU.
Softmax function gives probability related outputs and is best suited for output layers for generating useful results.
Simply use linear Functions in your regression models.
These functions are most commonly used in a neural network. Feel free to try the other functions for multiple results and compare them if you wanna learn and have the luxury of time.
For any doubts and suggestions kindly hit our mailbox. We're currently learning and expanding our resource of knowledge and making our best efforts to reach out to the world to learn together. We’ll try our best to help you out. We’re ‘supposedly’ very open to suggestions:) Jus’ kidding!