Abstract:Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum ($m_t$) and adaptive learning rate ($v_t$). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noises, poses significant challenges to Adam's standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam's performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates. If momentum and gradient are in sync, CAdam proceeds with parameter updates according to Adam's original formulation; if not, it temporarily withholds updates and monitors potential shifts in data distribution in subsequent iterations. This method allows CAdam to distinguish between the true distributional shifts and mere noise, and adapt more quickly to new data distributions. Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known optimizers, including the original Adam, in efficiency and noise robustness. Furthermore, in large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV).

Cautious Optimizers: Improving Training with One Line of Code

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

PROFIT: A Specialized Optimizer for Deep Fine Tuning

CAdam: Confidence-Based Optimization for Online Learning

Narrowing the Focus: Learned Optimizers for Pretrained Models

Old Optimizer, New Norm: An Anthology

Faster Convergence for Transformer Fine-tuning with Line Search Methods

PROFIT: A PROximal FIne Tuning Optimizer for Multi-Task Learning

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction

CaAdam: Improving Adam optimizer using connection aware methods

Practical tradeoffs between memory, compute, and performance in learned optimizers

Adam-mini: Use Fewer Learning Rates To Gain More

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

An Efficient Optimization Technique for Training Deep Neural Networks

AdamZ: An Enhanced Optimisation Method for Neural Network Training

Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers

Deconstructing What Makes a Good Optimizer for Language Models

An Isometric Stochastic Optimizer

A comparative study of recently deep learning optimizers