AdamW and AdaBelief: Optimizers Based on and Better than Adam

Ye Chen
8 min readNov 29, 2020

--

Adam is one of the most popular optimizers used in deep learning. However, in some cases, adaptive gradient methods like Adam do not generalize as well as SGD with momentum, for example, in image classification problems, as suggested by Wilson et al. [1]. In this blog, I am going to introduce two improved methods from Adam, which made Adam comparable w.r.t SGD with momentum. They are AdamW by Loshchilov et al. [2] and AdaBelief by Zhuang et al. [3]. I will explain how they modify the original Adam method, what’s the reason or benefit to do such modification, and what’s the performance compared with other methods. Experiments results, figures and conclusions from the original papers will be cited to better illustrate the problem.

AdamW: Decoupled Weight Decay Regularization (2017)

In this paper, they try to compare SGD and Adam, and identify which regularization, L_2 regularization or weight decay, is better to train deep neural networks. They found that using L_2 regularization might be the reason why Adam is outperformed by SGD with momentum, since common deep learning libraries only implement L_2 regularization instead of the original weight decay and they are not identical. When adapted in adaptive gradients, L_2 regularization tends to regularize weights with large historic parameter or gradient amplitudes less than they would be when using weight decay. Meanwhile, when used in SGD, they can be made equivalent by a reparameterization of the weight decay factor based on the learning rate.

To make Adam competitive w.r.t SGD with momentum, they improve the regularization method in Adam. Instead of frequently used L_2 regularization, they decouple the weight decay from the gradient-based update. The modified Adam method is AdamW. The following algorithm shows the differences between Adam and AdamW. In order to compare with SGD, they also provide SGDW, which uses decoupled weight decay as well.

Fig 1: SGD, SGDM, Adam, AdamW algorithm [2]

Then we can take a look at the experimental validation result.

Fig 2: The Top-1 test error of a 26 2x64d ResNet on CIFAR-10 measured after 100 epochs [2]

In the above figure from the paper, the authors run a 26 2x64d ResNet with SGD, SGDW, Adam and AdamW on CIFAR-10 to compare their test error. In the first row, we can see that SGD with L_2 regularization is not decoupled from the learning rate and the best hyperparameter settings lies on the diagonal, not aligned with the x-axis or y-axis, which is not convenient for tuning hyperparameters. It’s even worse in Adam with L_2 regularization in the second row. However, both SGDW and AdamW have good performance. They largely decouple weight decay and learning rate. AdamW in this case has comparable performance w.r.t SGDW with momentum, and both of them have a more separable hyperparameter space which simplifies hyperparameter tuning.

In the following figure, the authors trained a 26 2x96d ResNet with Adam and AdamW on CIFAR-10 to compare their performance. The top row is for learning curves and the bottom row is for generalization results. We can see that AdamW gives better result in every case.

Fig 3: Learning curves and generalization results obtained by a 26 2x96 ResNet trained with Adam and AdamW on CIFAR-10 [2]

Then they put many optimizers together, including SGDW, SGDWR, Adam, AdamWR and AdamW, to compare their generalization performance. In Figure 4, they show the training results on CIFAR-10 and ImageNet32x32. In the experiment, AdamW not only yielded better training loss, but also had better generalization performance than Adam, and competitive w.r.t SGDW.

Fig 4: Top-1 test error on CIFAR-10 and Top-5 test error on ImageNet32x32

In conclusion, this paper clarifies the inequivalence of L_2 regularization and weight decay for Adam, and decouples weight decay from the gradient-based update, which is AdamW. It has better generalization performance compared with Adam, and yields a more separable hyperparameter space for tuning. In the next paper, we can see AdamW’s performance compared with another new method named AdaBelief. It’s another modification to Adam and also obtains great performance and competitive w.r.t SGD with momentum.

AdaBelief Optimizer: Adapting Stepsizes by the Belief on Observed Gradients (2020)

This paper introduces AdaBelief which is easily modified from Adam even without extra parameters. Here the paper simply compares the two algorithms in the following figure:

Fig 5: Adam and AdaBelief [3]

The part marked in blue is where they are different. To intuitively understand the modification, it is viewing m_t as the prediction of g_t, and taking a large step when the observations g_t is close to the prediction m_t, or taking a small step when the observation g_t greatly deviates from the prediction m_t. Why update step size in this way? There are three benefits introduced in this paper.

The first benefit is that AdaBelief uses curvature information. Fig 6 shows how an ideal optimizer considers curvature of the loss function by Toussaint [4]. And the authors compare how different optimizers in various cases take steps in Fig 7. Instead of taking a large step when gradient is large and take a small step when gradient is small, an ideal optimizer should consider the change of loss function. We can find that AdaBelief is the only optimizer which follows the ideal way to take step in Fig 7.

Fig 6: An ideal optimizer considers curvature of the loss function [4]
Fig 7: Table of comparison of optimizers in various cases [3]

Secondly, AdaBelief considers the sign of gradient in denominator. The left part in Fig 8 shows how gradient should move in an optimal way. And the right part of the figure shows an experiment that Adam updates the EMA of (g_t)² only depends on the amplitude of gradient, while AdaBelief utilized both the magnitude and sign of g_t. It matches the behaviour of an ideal optimizer.

Fig 8: Adam and AdaBelief’s moving direction [3]

Thirdly, in low-variance case, AdaBelief can avoid update direction close to “sign descent” as Adam did. The following Fig 9 includes an update formula of Adam just behaves like a sign descent in low-variance situation. But for AdaBelief, when the variance of gradient at time t is the same for all coordinates, the update direction matches the gradient direction. When the variance is not uniform, AdaBelief will take a small step when the variance is large, and take a large step when the variance is small.

Fig 9: Adam update formula in low-variance situation [3]

Then the paper compares convergence speed among SGD, Adam and AdaBelief. In Fig 10, we can see that AdaBelief is always the first to reach the optimal point in different settings. They also provide a video example to show the simulation process in their github https://github.com/juntang-zhuang/Adabelief-Optimizer.

Fig 10: Trajectories of SGD, Adam and AdaBelief [3]

Then they show the generalization performance on several models and datasets in Fig 11, inluding AdaBelief, SGD, Adam, AdamW, etc. AdaBelief gives the best performance in different models and dataset. Meanwhile, AdamW also gives a good result, which outperformed Adam a lot and is competitive w.r.t SGD.

Fig 11: Comparison of generalization performance [3]

Lastly, they train AdaBelief in GANs to see its stability, because the stability of optimizers is quite essential in practice and some recently released optimizers lack such experimental validations. They experiment with Wasserstein-GAN (WGAN), WGAN-GP and SN-GAN, which is a large model using a ResNet generator and spectral normalization in the discriminator. With both large and small GANs, AdaBelief achieves the lowest FID. Here I only put one of the experiment results with SN-GAN in Fig 12 for you to check its performance. You can find more in the original paper.

Fig 12: FID of a SN-GAN with ResNet generator on CIFAR-10 [3]

In conclusion, AdaBelief gives great performance in different aspects. There are three properties of AdaBelief. Firstly it has fast convergence as in adaptive gradient methods. Secondly, it has comparable generalization performance as in the SGD family. Thirdly, it can train stably in complex settings such as GAN. And we can find that compare with AdamW in generalization experiment in Fig 11, AdaBelief shows higher test accuracy.

Conclusion

In summary, the two papers I introduced here all did modification and improvement on Adam from different perspectives. The first paper introduced AdamW, which decouples the weight decay from the gradient-based update process to achieve better generalization result and good for hyperparameter tuning. This strategy can also be applied in SGD to obtain a more sparable hyperparameter space. The second paper modifies the way of updating step in Adam without introducing any new parameters and greatly improve the performance of Adam. They achieved faster convergence speed, competitive generalization performance w.r.t SGD family and stable training in complex settings like GAN. Those two papers are both relatively new, and from them we can see the great potential even within a very popular gradient descent strategy. It encourages people to try more possible solutions when training models even with slight change, and also leaves a question that even though there seems like more better strategies coming up, Adam is still one of the most popular to use. All of the three and more strategies still have their advantages and disadvantages, and researchers should find the most suitable situation to use.

[1] AWilson, Ashia C., et al. “The marginal value of adaptive gradient methods in machine learning.” Advances in neural information processing systems. 2017.

[2] Loshchilov, Ilya, and Frank Hutter. “Decoupled weight decay regularization.” arXiv preprint arXiv:1711.05101 (2017).

[3] Zhuang, Juntang, et al. “AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients.” Advances in Neural Information Processing Systems 33 (2020).

[4] Marc Toussaint, “Lecture notes: Some notes on gradient descent,” 2012.

--

--

Ye Chen
Ye Chen

No responses yet