Gradient Descent- Geometrical Interpretation
One of the most critical tasks in Computer Science, Machine Learning, and computation, in general, is to optimize solutions. Not only important but it was also one of the most challenging tasks until the entry of modern computers. Gradient Descent is one such computational technique that helps us to optimize problems and the good thing is that it’s trivial to understand and get the intuition of it :)
Pre-requisites
I won't be saying that there are no prerequisites and you can learn this topic without prior knowledge this time. For you to understand the underlying mathematics of Gradient Descent you need to have a good knowledge of -
- Differntiation and Integration
- Maxima and Minima
- Vector Differentiation ( Just a bit of understanding would work)
- Your time and interest :D
Why do we need Gradient Descent?
Let’s take the logistic loss expression with regularization for this example-
From the equation itself, we can say that x_i and y_i are constants because these values are already present in the dataset itself. So L here is a function of W i.e the variable to find is W. So we can rewrite the equation as-
Now to calculate the maxima or minima we need to calculate the vector differentiation of L w.r.t W (because W is a vector here). For simplicity let’s the consider
part 1 of the expression as :
and part 2 of the expression as :
Let’s consider the 2nd part for the moment-
if W was a scalar then-
So derivative of this will be-
The vector equivalent will also be the same
Now let’s consider the 1st part-
For this part, we will use the chain rule as usual and we will get the following
So the overall equation will be —
These problems are extremely hard to solve, so we use computational methods like Gradient Descent.
Geometric Interpretation
The gradient Descent algorithm is iterative and this is how it works:
- First, we take a random value of x* where
x*= argmin f(x) w.r.t x
If you didn’t understand, x* by the end of the algorithm will achieve the value of argmin f(x) w.r.t x but initially, we are assigning a random value to it
2. So, x_0= initial value of x*
3. As more and more iterations happen we will keep on assigning newer values:
x_1= value of x* at iteration 1
x_2= value of x* at iteration 2
and so on…..
At the end of the iteration, we will have x_k whose value will be very close to x*
Now before showing how exactly the algorithm will work, let’s clear some basic concepts-
Let us have the following function f(x) then -
a) The x* marked here is what the ideal x* or the one at the end of the algorithm will look like approximately because we are calculating the value of x which will result in the min value of f(x).
b) In quadrant 1, all the slopes will be positive and in quadrant 2 all slopes will be negative.
c) min f(x)=- max f(x)
max f(x)=- min f(x)
Working of the Gradient Descent Algorithm
a) Let’s pick a point x_0 initially at random
b) Now we will find x_1 such that x_1 is closer to x* and
r= step-size and let r=1
Slope of f(x) at x_0 will be positive i.e [df/dx] at x_0 will be positive
so x_1=x_0–(1*+ve) => x1 shifted towards left of x_0 and is closer to x*
c) Similarly we will calculate x2
Now we can generalize the whole thing as:
We will have k iterations, so we will have-
x_0 , x_1 ……..x_k
The termination condition of the loop will be :
if x_{k+1} — x_k is very small then terminate at x*=x_k
Intuitively what we are doing is reducing the value of the slope at each iteration.
The problem with fixed r or step-size value
Till now we were keeping the value of step-size or r constant which is a problem and we will understand why it is so-
Let us have a function f(x)=x² and after some iterations, we reached x_i=0.5
Now if we calculate x_{i+1} we will find that it will have a value of -0.5 which is going away from x*
Again x_{i+2} will have a value of +0.5. Similarly x_{i+3}=-0.5 and x_{i+4}=+0.5 and we will keep oscillating between -0.5 and +0.5. This kind of problem arises when the r value is kept fixed. So we reduce r with each iteration and keep it dynamic.