New variation of Adam?
Eliminate the generalization gap between adaptive methods and SGD;
TL;DR: A Faster And Better Optimizer with Highly Robust Performance;
- Dynamic bound on learning rates. Inspired by gradient clipping;
- Not very sensitive to the hyperparameters, especially compared with Sgd(M);
- Tested on MNIST, CIFAR, Penn Treebank - no serious datasets;
Abstract Adaptive optimization methods such as AdaGrad, RMSProp and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with Sgd or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods.