Advanced optimization strategies for better convergence and speed(part-1)🚀

6 min readDec 3, 2020

Don’t allow defaults to control your gradients, because gradients are precious for deep learning😬

Image grabbed from https://losslandscape.com/gallery/

Adam is a decent optimization algorithm that was initially published in 2014, but with the advances in architectures and the compute Adam is no longer a perfect choice. Just to emphasize the purpose of better optimization strategies consider the billionaire GPT-3(in the algorithmic world this guy must be a billionaire if it ever exists😅) where a single batch size runs in millions. Now the question is can Adam handle such a large size? I seriously doubt that(well, you cannot validate my statement unless you're a billionaire who is willing to burn your cash on Azure😛🤣). The whole point of this article is not to blame early optimizers, but modern problems required modern solutions. In this article let’s see what are those modern problems and modern solutions👀

ADAM

Well, let’s start by understanding what this guy will do when you employ him to control your gradient update

Equations 1 and 2 are first and second-order exponential moving averages which can be termed as momentum. Momentum is to avoid completely taking a random direction based on the currently estimated gradient and making sure the update is aligning with the previous gradient direction. Equation 3 is an effort to debiasing, it means at t=1 we initialize the values of mt and vt as zeros which results in scaling the gradients down because of giving more weightage to the previous momentum which is 0 thus reducing the overall estimated momentum. This effect appears until few initial iterations, to overcome this initial effect we simply scale up the gradients by dividing with (1-𝛽^t) where 𝛽∈(0,1) over time (1-𝛽^t) →1 and similarly for vt in equation 4

In equation 5 as we see it learning rate is adaptively scaled, let’s understand what’s happening. If the square root of second-order momentum(vt) is high let’s say >1 we simply scale the learning rate down to decrease the magnitude of the gradient as we don’t want the momentum to take large steps, by doing so we are taking the direction of the momentum by avoiding large strides. On the flip side if the gradient is very small say close to zero we scale the learning rate up to take a minimum stride in the direction of gradient which helps in achieving faster convergence. simple and straight forward that’s what Adam. does.

various optimization algorithms have been proposed until recently, let’s start by understanding the very need for these optimizer’s and the problems they are trying to solve in detail

RADAM (Rectified Adam)

we often see warm-up strategies while taring Neural machine translation(NMT), BERT pre-training so on, and so forth. Instead of setting the learning rate α_t as an adaptive learning rate, a learning rate warmup strategy sets α_t as smaller values in the first few steps. For example, linear warmup sets αt = t ×α_0 for t<Tw (α_0: constant learning rate, t: iteration number, Tw: maximum number of iterations for warm-up). Warmup has been demonstrated to be beneficial in many deep learning applications. For example, in the NMT experiments, the training loss converges at around 10 when warmup is not applied (Adam-vanilla), and it surprisingly decreases after applying warmup (Adam-warmup). Now the question is why do we need warmups if Adam has a mechanism of adaptive learning rate?

image grabbed from https://arxiv.org/abs/1908.03265

The above plot is a histogram of gradients(absolute of the gradients along X-axis in the log scale stacked along Y-axis which represents iterations) if we closely look at the second diagram after first few updates the magnitude of the gradients is increasing from 4 ×10⁶ to ~ 10⁷ making the update to be large and trapped in bad/suspicious local optima. when we do the warmup it essentially reduces the impact of these problematic updates to avoid the convergence problem.

The problem must have been with the adaptive ratio making the learning rate high because of the low initial value of the square root of second-order momentum (√Vt) which can be thought of as a variance term. Because of the high variance considering √1/Vt we are doing a drastic upscale to the learning rate thereby increasing the magnitude of the gradients. What we are doing with the warmup is sticking to a small learning rate which is equivalent to variance reduction of the term (√1/Vt).

Instead of explicit warmup, we could also do some simple fixes to the vanilla-Adam

Adam 2k: In order to reduce the variance (√1/Vt) of the adaptive learning rate, Adam-2k only updates second-order momentum in the first two thousand iterations keeping the first-order momentum and weights of the network constant thereby giving some time to the variance to settle.
Adam-eps: The other tweak which we could do is increase the value of Epsilon to reduce the effective learning rate without changing the variance term.

Rectified Adam

The main idea behind this optimizer is to reduce the variance without any need for explicit thresholding on the iterations or the need for choosing best ∈ which is one more hyperparameter to tune.

The above equations are very much similar to Adam with a rectification term. Also if you notice two different optimizers are involved, at iteration 1 with β2=0.99 and P-∞ = 198.999 equation 5 reduces to Pt=1 thereby using SGD optimizer (equation 9). This is because in the initial stages simple SGD is proven to be much better than Adam.

From the above plots if we look at the violet curve that corresponds to β2=0.99 Pt is crossing 4 before the first 10 iterations. once Pt crosses the threshold of 4 we fall back to the rectified version of Adam where the rectification term(Rt) is directly multiplied to the learning rate to reduce the variance for some initial iterations. If we take a look at the violet line in figure 3 (Rectification), the value of Rt is less than 1 for almost 1000 iterations and eventually becoming 1. Once after the 1000 iterations optimizer acts like good old Adam.

Results:

In the following diagram Performance of Radam isn’t affected too much by the initial learning rates unlike others, this is fantastic.

In the following plot if we compare the test accuracy of first and last plot Radam slightly outperforms Adam with a warm-up for 500/1000 iterations. who wants to perform a search on an effective number of warm-up steps 😎

To conclude Radam is one of the best Optimizers for many existing tasks. This could easily outperform the default set of optimizers out there and perhaps Radom should become a default optimizer just like Adam and why not?

Advanced optimization strategies for better convergence and speed(part-1)🚀

ADAM

RADAM (Rectified Adam)

Written by Maharshi Yeluri