LipsFormer: Introducing Lipschitz Continuity to Vision Transformers

Xianbiao Qi,Jianan Wang,Yihao Chen,Yukai Shi,Lei Zhang
2023-04-20
Abstract:We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7\% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M parameters. The code will be released at \url{<a class="link-external link-https" href="https://github.com/IDEA-Research/LipsFormer" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is the instability of the Transformer model during the training process. Specifically, the authors point out that although the Transformer has achieved great success in fields such as Natural Language Processing (NLP) and Computer Vision (CV), its training process is often very unstable, especially in the early stages of training. This instability not only affects the convergence speed of the model but can also lead to poor model performance. To tackle this issue, the authors propose a new Transformer architecture called LipsFormer, which ensures training stability and model robustness by introducing Lipschitz continuity. Lipschitz continuity is a mathematical property that can limit the rate of change of a function, thereby ensuring that the output of the network does not undergo drastic changes under small perturbations of the input or weights. The authors believe that Lipschitz continuity is a more essential attribute for ensuring the stability of Transformer training. Specifically, LipsFormer improves several key components of the Transformer: 1. **Layer Normalization (LayerNorm)**: Replaces LayerNorm with CenterNorm to avoid training instability caused by the standard deviation approaching zero. 2. **Self-Attention Mechanism**: Replaces Dot-Product Attention with Scaled Cosine Similarity Attention to ensure the Lipschitz continuity of the self-attention module. 3. **Residual Shortcut**: Introduces Weighted Residual Shortcut, which controls the Lipschitz constant of the residual path through a learnable scaling factor. 4. **Weight Initialization**: Adopts Spectral Initialization instead of the traditional Xavier initialization to ensure the 1-Lipschitz continuity of convolutional and feedforward connections. With these improvements, LipsFormer can achieve stable training without using traditional techniques such as Learning Rate Warmup. Experimental results on the ImageNet dataset show that LipsFormer achieves higher accuracy on classification tasks compared to existing state-of-the-art Vision Transformer models.