Weight subcloning: direct initialization of transformers using larger pretrained ones

Mohammad Samragh,Mehrdad Farajtabar,Sachin Mehta,Raviteja Vemulapalli,Fartash Faghri,Devang Naik,Oncel Tuzel,Mohammad Rastegari

2023-12-15

Abstract:Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.

Machine Learning,Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address the issue of the large amount of data and computational resources required to train large Transformer models from scratch for target tasks. The usual approach is to speed up convergence and increase training speed by initializing a model of the same size and specifications as the pre-trained model. However, what should be done if the required size of the pre-trained model is not available? This paper introduces a simple and effective technique—weight subcloning, which transfers the knowledge of the pre-trained model to a smaller variant. Specifically, weight subcloning is achieved through the following steps: 1. **Neuron Importance Sorting**: Reducing the embedding dimension of each layer of the pre-trained model. 2. **Removing or Adding Blocks**: Adjusting the number of layers in the Transformer model to match the reduced network. The results show that weight subcloning significantly improves training speed. For example, in experiments on image classification and language models, weight subcloning can increase training speed by 4 times.

Weight subcloning: direct initialization of transformers using larger pretrained ones

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

Transfer training from smaller language model

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Learning to Grow Pretrained Models for Efficient Transformer Training

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

Reusing Pretrained Models by Multi-linear Operators for Efficient Training

Mimetic Initialization of Self-Attention Layers

Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge Distillation

On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models

Isomorphic Model-Based Initialization for Convolutional Neural Networks

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Prune Once for All: Sparse Pre-Trained Language Models

Dynamic Clone Transformer for Efficient Convolutional Neural Netwoks

Effective Theory of Transformers at Initialization

WAVE: Weight Template for Adaptive Initialization of Variable-sized Models

Simpler is Better: off-the-shelf Continual Learning Through Pretrained Backbones

Deep Fusion: Efficient Network Training via Pre-trained Initializations