Abstract:We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the well-trained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2x training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19x parameters stored to initialize these models and around 5x pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9x parameters stored to initialize, compared to the pre-training approach.

Increasing transformer token length with a Maximum Entropy Principle Method

Scaling Transformer to 1M tokens and beyond with RMT

Transformers Can Achieve Length Generalization But Not Robustly

Reformer: The Efficient Transformer

Reducing the Transformer Architecture to a Minimum

A General and Efficient Training for Transformer via Token Expansion

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Efficiently Scaling Transformer Inference

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Transformers Can Do Arithmetic with the Right Embeddings

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Looped Transformers for Length Generalization

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Balancing Cost and Benefit with Tied-Multi Transformers

Make A Long Image Short: Adaptive Token Length for Vision Transformers

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Transformer As Linear Expansion of Learngene

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling