Abstract:The rapid advancements in Large Vision Models (LVMs), such as Vision Transformers (ViTs) and diffusion models, have led to an increasing demand for computational resources, resulting in substantial financial and environmental costs. This growing challenge highlights the necessity of developing efficient training methods for LVMs. Progressive learning, a training strategy in which model capacity gradually increases during training, has shown potential in addressing these challenges. In this paper, we present an advanced automated progressive learning (AutoProg) framework for efficient training of LVMs. We begin by focusing on the pre-training of LVMs, using ViTs as a case study, and propose AutoProg-One, an AutoProg scheme featuring momentum growth (MoGrow) and a one-shot growth schedule search. Beyond pre-training, we extend our approach to tackle transfer learning and fine-tuning of LVMs. We expand the scope of AutoProg to cover a wider range of LVMs, including diffusion models. First, we introduce AutoProg-Zero, by enhancing the AutoProg framework with a novel zero-shot unfreezing schedule search, eliminating the need for one-shot supernet training. Second, we introduce a novel Unique Stage Identifier (SID) scheme to bridge the gap during network growth. These innovations, integrated with the core principles of AutoProg, offer a comprehensive solution for efficient training across various LVM scenarios. Extensive experiments show that AutoProg accelerates ViT pre-training by up to 1.85x on ImageNet and accelerates fine-tuning of diffusion models by up to 2.86x, with comparable or even higher performance. This work provides a robust and scalable approach to efficient training of LVMs, with potential applications in a wide range of vision tasks. Code: <a class="link-external link-https" href="https://github.com/changlin31/AutoProg-Zero" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the economic and environmental cost issues caused by the high demand for computing resources during the training process of large - scale visual models (LVMs). With the development of large - scale visual models such as Vision Transformers (ViTs) and diffusion models, the scale of their training has grown rapidly, which not only requires a large amount of computing resources but also brings significant financial and environmental burdens. Therefore, developing efficient training methods to reduce these costs has become an important research direction. The paper proposes an advanced automated progressive learning framework (AutoProg), aiming to efficiently train large - scale visual models by gradually increasing the model capacity. Specifically, the main contributions and solution strategies of the paper include: 1. **Efficient pre - training**: - The AutoProg - One scheme is proposed to optimize the pre - training process of ViTs through momentum growth (MoGrow) and one - time growth plan search. - The momentum growth (MoGrow) operator is introduced, which smoothes the performance degradation during the model growth process by maintaining a momentum network. 2. **Extension to fine - tuning and transfer learning**: - The AutoProg framework is extended to handle the transfer learning and fine - tuning of LVMs, especially for diffusion models. - AutoProg - Zero is introduced, which eliminates the need for one - time hyper - network training through zero - shot thawing plan search. - A unique stage identifier (SID) scheme is proposed to bridge the optimization gap during model growth. 3. **Extensive experimental verification**: - The experimental results show that AutoProg not only accelerates the training process but also maintains or even improves the model performance on multiple datasets and LVM architectures. - For ViTs, the pre - training acceleration of AutoProg - One on the ImageNet dataset reaches 1.85 times. - For the fine - tuning of diffusion models, AutoProg - Zero achieves 2.86 - fold and 2.56 - fold accelerations on Stable Diffusion and DiT respectively. Through these innovations, the paper provides a powerful and scalable method that can significantly reduce the training time without sacrificing accuracy or performance, thus providing a new solution for the efficient training of large - scale visual models.

Efficient Training of Large Vision Models via Advanced Automated Progressive Learning

Automated Progressive Learning for Efficient Training of Vision Transformers

Auto-scaling Vision Transformers without Training

GhostViT: Expediting Vision Transformers Via Cheap Operations

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

Large-batch Optimization for Dense Visual Predictions

Effective Vision Transformer Training: A Data-Centric Perspective

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

Super Vision Transformer

Efficient Low-rank Backpropagation for Vision Transformer Adaptation

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery

Data-efficient Large Vision Models through Sequential Autoregression

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning

LaVin-DiT: Large Vision Diffusion Transformer

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

ViT-MVT: A Unified Vision Transformer Network for Multiple Vision Tasks.

A Closer Look at Self-Supervised Lightweight Vision Transformers

HSViT: Horizontally Scalable Vision Transformer