Abstract:We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7\% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M parameters. The code will be released at \url{<a class="link-external link-https" href="https://github.com/IDEA-Research/LipsFormer" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem this paper attempts to address is the instability of the Transformer model during the training process. Specifically, the authors point out that although the Transformer has achieved great success in fields such as Natural Language Processing (NLP) and Computer Vision (CV), its training process is often very unstable, especially in the early stages of training. This instability not only affects the convergence speed of the model but can also lead to poor model performance. To tackle this issue, the authors propose a new Transformer architecture called LipsFormer, which ensures training stability and model robustness by introducing Lipschitz continuity. Lipschitz continuity is a mathematical property that can limit the rate of change of a function, thereby ensuring that the output of the network does not undergo drastic changes under small perturbations of the input or weights. The authors believe that Lipschitz continuity is a more essential attribute for ensuring the stability of Transformer training. Specifically, LipsFormer improves several key components of the Transformer: 1. **Layer Normalization (LayerNorm)**: Replaces LayerNorm with CenterNorm to avoid training instability caused by the standard deviation approaching zero. 2. **Self-Attention Mechanism**: Replaces Dot-Product Attention with Scaled Cosine Similarity Attention to ensure the Lipschitz continuity of the self-attention module. 3. **Residual Shortcut**: Introduces Weighted Residual Shortcut, which controls the Lipschitz constant of the residual path through a learnable scaling factor. 4. **Weight Initialization**: Adopts Spectral Initialization instead of the traditional Xavier initialization to ensure the 1-Lipschitz continuity of convolutional and feedforward connections. With these improvements, LipsFormer can achieve stable training without using traditional techniques such as Learning Rate Warmup. Experimental results on the ImageNet dataset show that LipsFormer achieves higher accuracy on classification tasks compared to existing state-of-the-art Vision Transformer models.

LipsFormer: Introducing Lipschitz Continuity to Vision Transformers

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Unified Normalization for Accelerating and Stabilizing Transformers

Improve Vision Transformers Training by Suppressing Over-smoothing

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

DeepViT: Towards Deeper Vision Transformer

Demystify Transformers & Convolutions in Modern Image Deep Networks

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Understanding the Difficulty of Training Transformers

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Lite Vision Transformer with Enhanced Self-Attention

Vicinity Vision Transformer

On the Connection between Local Attention and Dynamic Depth-wise Convolution

BViT: Broad Attention based Vision Transformer

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit