Abstract:The Transformer architecture is ubiquitously used as the building block of large-scale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5x, 2.5x faster runtime and 1.2x, 2.0x lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6x lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

A Fast Post-Training Pruning Framework for Transformers

A Multi-Level Framework for Accelerating Training Transformer Models

Staged Training for Transformer Language Models

Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Algorithmic progress in language models

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

Data Movement is All You Need: A Case Study on Optimizing Transformers.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

Training Compute-Optimal Large Language Models

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Understanding the Difficulty of Training Transformers

Improving Automatic Parallel Training Via Balanced Memory Workload Optimization

Primer: Searching for Efficient Transformers for Language Modeling

H3T: Efficient Integration of Memory Optimization and Parallelism for High-Throughput Transformer Training

Preparing Lessons for Progressive Training on Language Models

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity