Unlocking the Power of ReLU: Why It Reigns Supreme Over Sigmoid and Tanh

8 min readMar 10, 2023

As you pursue this blog, I'm assuming you have some general knowledge of perceptrons, multilevel perceptrons, and neural networks. Therefore, the phrase “activation function” has probably already come up in conversation. I won’t go into detail about what an activation function is here, but the main goal of this blog is to describe why, mathematically speaking, the most recent neural networks use ReLU rather than sigmoid or tanh.

In the 1980s and 1990s, researchers were mostly using sigmoid or tanh for perceptrons or multi-level perceptrons, but then there was a paradigm shift, and ReLU became the usual activation function in most neural networks. Before we can comprehend why this occurred, we must first understand the sigmoid and tanh activation functions.

Sigmoid Activation Function

Let us have a perceptron:

The activation function ‘f’ that is being used in this instance is actually a sigmoid activation function, denoted by σ and is defined as-

The derivative of the sigmoid function can be defined as:

We can therefore compute the derivative of the sigmoid function using itself, which is a fun fact. As a result, in the neural network, σ(z) can be used for both forward propagation and backpropagation.

Now let’s understand the sigmoid function in a graphical manner:

From the graph itself, we can observe that the value of the sigmoid function for any value of z could almost be 1.

Min value of σ = 0

Max value of σ = 1

Now let’s look how the derivative of a sigmoid function would look like:

Graph of Derivative of Sigmoid. Point A here is 0.25

The derivative of a sigmoid can have a maximum value of 0.25.

The sigmoid function is popular because its derivative can be written as a function of sigmoid only.

Tanh Activation Function

Tanh is another activation function that gained popularity in the 1990s, much like the sigmoid function did. The tanh function is defined as:

The derivative of the tanh activation function can be defined as:

The similarity between tanh and sigmoid is that even in the case of tanh, we can represent its derivative using the tanh function itself

Let’s look at the graph of the tanh function and its derivative:

The derivative of the tanh function can have a maximum value of 1.

A problem known as the “vanishing gradient problem,” both of these activation functions are not very well-liked in today’s deep learning world. In addition to this issue, there is another issue known as the “exploding gradient problem” that, depending on the situation, could affect the sigmoid activation function.

Vanishing Gradient Problem

The Vanishing Gradient problem is one of the biggest problems that used to occur in the 90s and 80s and I will try to explain the problem with a mathematical approach. But before diving into it it's recommended that you understand partial derivatives and the chain rule of partial derivatives.

Let’s consider a neural network:

O_i,j : Output of function f_i,j and W is the weight ( superscript 11 denotes from input 1 to function 1 and subscript 1 denotes layer 1)

Suppose we want to update the weight W_{11}^{1} ( The weight shown here is the only weight in the diagram above. Sorry because latex didn’t work :3)

equation 1

Calculating the derivative in this equation is the most challenging part and is computationally expensive.

Using the chain rule, we can write the derivative as:

equation 2

We can also say that:

equation 3

this is because O_31 is the output of f_31.

To understand the vanishing gradient problem we need to keep one fact in mind the derivative of the sigmoid function can never be greater than 0.25. So in equation 2, all the values present in the squared braces will be less than equal to 0.25.

Let’s consider the first part of the equation present in the squared braces of equation 2 i.e

Let’s assign values to each of these derivatives which will be less than equal to 0.25 as derivative of the sigmoid function can have a maximum value of 0.25

Thus the multiplication of these three will be :

0.2 * 0.1 * 0.05 = 0.0010

Similarly, if we consider the second part of the equation present in the squared braces of equation 2, it will be also a very small value. As a result, the overall derivative will be very small.

If -

then-

The value of W_new is very very close to W_old and this is the result of a 3-layered Neural Network. If we use a 10-layered Neural Network, the difference will be much much smaller and the thing is we are spending a lot of computation power to calculate this derivative.

This phenomenon of the derivatives becoming small is called-

Vanishing Gradient Problem.

Exploding Gradient Problem

The name “exploding gradient problem” makes a lot of logic given that we now understand what the “vanishing gradient problem” is. An exploding gradient problem occurs when the derivative values are significantly higher and differ significantly from the prior values. The brilliance of mathematics is that even though this issue doesn’t seem to arise with sigmoid activation function, it can in certain circumstances.

Let we want to update one of the weights of our neural network. The equation to update the weight:

What happens in Exploding Gradient Problem is that the derivative values will be very high (This happens when each of the derivative values as shown in equation 2 are greater than 1 and thus the final result will be very high). As a result, the derivative values at each epoch will differ by a huge margin and we will never be able to converge. In case you don’t know what convergence means, it’s a situation where the new and old derivative values will be almost identical, and so we will stop the iteration.

ReLU ( Rectified Linear Units or Rectifiers)

The most fascinating portion of this blog post is about to begin. We are aware of how sigmoid and tanh activation work and why they are not the best options. What ReLU is and how it circumvents the Vanishing and Exploding Gradient problem will be discussed now.

Let us have a neural network:

The activation function f(z) here is actually the ReLU activation function, and we will denote it as:

The ReLU activation function is defined as:

We can also write it as:

The graphical representation of ReLU is:

One important thing is that the angle formed by the function and X-axis in the +ve quadrant it 45 deg because the points are like (1,1) , (2,2) …..

Now keeping this is mind let’s differentiate the function. We know that derivative means tan x. So for f_relu(z) whenever z <0 , the derivative will be 0 as tan 0 = 0(angle formed by the function and the X axis is 0) and whenever z>0, the derivative will be 1 as tan 45 =1.

Vanishing and exploding gradient problems, in which the derivative values were less than and larger than 1, respectively, were the issues we were having with sigmoid and tanh. However, the derivative value for ReLU is either 0 or 1. Thus, the issue of vanishing and exploding gradients won’t ever arise, but this issue is still problematic. The updated value will stay the same as the prior values if one of the derivatives turns 0. Even though it doesn’t happen frequently, this is known as a dead activation problem. However, even in that scenario, we have a hack called Leaky ReLU.

The idea of Leaky ReLU is very ingenious. In normal ReLU the value of f_relu(z) when z<0 is 0, here in leaky ReLU instead of making it 0, when z<0, the result is ‘az’ where a is a very very small number (eg: 0.001). So by having this small constant multiplied, we are avoiding the problem of dead activation, but we might run into the problem of the Vanishing Gradient Problem. So yeah the tradeoff is there :D

If you enjoyed this article, share it with your friends and colleagues, and do comment!