Abstract:Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

APOLLO: an Optimized Training Approach for Long-form Numerical Reasoning.

A Numerical Reasoning Question Answering System with Fine-grained Retriever and the Ensemble of Multiple Generators for FinQA

Numerical Reasoning for Financial Reports

CBR-Ren: A Case-Based Reasoning Driven Retriever-Generator Model for Hybrid Long-Form Numerical Reasoning

FinQA: A Dataset of Numerical Reasoning over Financial Data

Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning

Operation-Augmented Numerical Reasoning for Question Answering

Exploring Equation As a Better Intermediate Meaning Representation for Numerical Reasoning of Large Language Models

DyRRen: A Dynamic Retriever-Reranker-Generator Model for Numerical Reasoning over Tabular and Textual Data.

Enhancing Numerical Reasoning with the Guidance of Reliable Reasoning Processes

ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering

NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering

MathDQN: Solving Arithmetic Word Problems Via Deep Reinforcement Learning.

Reflection of Thought: Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems

APOLLO: SGD-like Memory, AdamW-level Performance

FinLLMs: A Framework for Financial Reasoning Dataset Generation with Large Language Models

Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search

Number Cookbook: Number Understanding of Language Models and How to Improve It

Targeted training for numerical reasoning with large language models

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains