Advanced optimization strategies for better convergence and speed(part-1)šŸš€

Maharshi Yeluri
6 min readDec 3, 2020

Donā€™t allow defaults to control your gradients, because gradients are precious for deep learningšŸ˜¬

Adam is a decent optimization algorithm that was initially published in 2014, but with the advances in architectures and the compute Adam is no longer a perfect choice. Just to emphasize the purpose of better optimization strategies consider the billionaire GPT-3(in the algorithmic world this guy must be a billionaire if it ever existsšŸ˜…) where a single batch size runs in millions. Now the question is can Adam handle such a large size? I seriously doubt that(well, you cannot validate my statement unless you're a billionaire who is willing to burn your cash on AzurešŸ˜›šŸ¤£). The whole point of this article is not to blame early optimizers, but modern problems required modern solutions. In this article letā€™s see what are those modern problems and modern solutionsšŸ‘€

ADAM

Well, letā€™s start by understanding what this guy will do when you employ him to control your gradient update

Equations 1 and 2 are first and second-order exponential moving averages which can be termed as momentum. Momentum is to avoid completely taking a random direction based on the currently estimated gradient and making sure the update is aligning with the previous gradient direction. Equation 3 is an effort to debiasing, it means at t=1 we initialize the values of mt and vt as zeros which results in scaling the gradients down because of giving more weightage to the previous momentum which is 0 thus reducing the overall estimated momentum. This effect appears until few initial iterations, to overcome this initial effect we simply scale up the gradients by dividing with (1-š›½^t) where š›½āˆˆ(0,1) over time (1-š›½^t) ā†’1 and similarly for vt in equation 4

In equation 5 as we see it learning rate is adaptively scaled, letā€™s understand whatā€™s happening. If the square root of second-order momentum(vt) is high letā€™s say >1 we simply scale the learning rate down to decrease the magnitude of the gradient as we donā€™t want the momentum to take large steps, by doing so we are taking the direction of the momentum by avoiding large strides. On the flip side if the gradient is very small say close to zero we scale the learning rate up to take a minimum stride in the direction of gradient which helps in achieving faster convergence. simple and straight forward thatā€™s what Adam. does.

various optimization algorithms have been proposed until recently, letā€™s start by understanding the very need for these optimizerā€™s and the problems they are trying to solve in detail

RADAM (Rectified Adam)

we often see warm-up strategies while taring Neural machine translation(NMT), BERT pre-training so on, and so forth. Instead of setting the learning rate Ī±_t as an adaptive learning rate, a learning rate warmup strategy sets Ī±_t as smaller values in the first few steps. For example, linear warmup sets Ī±t = t ƗĪ±_0 for t<Tw (Ī±_0: constant learning rate, t: iteration number, Tw: maximum number of iterations for warm-up). Warmup has been demonstrated to be beneficial in many deep learning applications. For example, in the NMT experiments, the training loss converges at around 10 when warmup is not applied (Adam-vanilla), and it surprisingly decreases after applying warmup (Adam-warmup). Now the question is why do we need warmups if Adam has a mechanism of adaptive learning rate?

The above plot is a histogram of gradients(absolute of the gradients along X-axis in the log scale stacked along Y-axis which represents iterations) if we closely look at the second diagram after first few updates the magnitude of the gradients is increasing from 4 Ɨ10ā¶ to ~ 10ā· making the update to be large and trapped in bad/suspicious local optima. when we do the warmup it essentially reduces the impact of these problematic updates to avoid the convergence problem.

The problem must have been with the adaptive ratio making the learning rate high because of the low initial value of the square root of second-order momentum (āˆšVt) which can be thought of as a variance term. Because of the high variance considering āˆš1/Vt we are doing a drastic upscale to the learning rate thereby increasing the magnitude of the gradients. What we are doing with the warmup is sticking to a small learning rate which is equivalent to variance reduction of the term (āˆš1/Vt).

Instead of explicit warmup, we could also do some simple fixes to the vanilla-Adam

  1. Adam 2k: In order to reduce the variance (āˆš1/Vt) of the adaptive learning rate, Adam-2k only updates second-order momentum in the first two thousand iterations keeping the first-order momentum and weights of the network constant thereby giving some time to the variance to settle.
  2. Adam-eps: The other tweak which we could do is increase the value of Epsilon to reduce the effective learning rate without changing the variance term.

Rectified Adam

The main idea behind this optimizer is to reduce the variance without any need for explicit thresholding on the iterations or the need for choosing best āˆˆ which is one more hyperparameter to tune.

The above equations are very much similar to Adam with a rectification term. Also if you notice two different optimizers are involved, at iteration 1 with Ī²2=0.99 and P-āˆž = 198.999 equation 5 reduces to Pt=1 thereby using SGD optimizer (equation 9). This is because in the initial stages simple SGD is proven to be much better than Adam.

From the above plots if we look at the violet curve that corresponds to Ī²2=0.99 Pt is crossing 4 before the first 10 iterations. once Pt crosses the threshold of 4 we fall back to the rectified version of Adam where the rectification term(Rt) is directly multiplied to the learning rate to reduce the variance for some initial iterations. If we take a look at the violet line in figure 3 (Rectification), the value of Rt is less than 1 for almost 1000 iterations and eventually becoming 1. Once after the 1000 iterations optimizer acts like good old Adam.

Results:

In the following diagram Performance of Radam isnā€™t affected too much by the initial learning rates unlike others, this is fantastic.

In the following plot if we compare the test accuracy of first and last plot Radam slightly outperforms Adam with a warm-up for 500/1000 iterations. who wants to perform a search on an effective number of warm-up steps šŸ˜Ž

To conclude Radam is one of the best Optimizers for many existing tasks. This could easily outperform the default set of optimizers out there and perhaps Radom should become a default optimizer just like Adam and why not?

--

--