Abstract:Following their success in natural language processing (NLP), there has been a shift towards transformer models in computer vision. While transformers perform well and offer promising multi-tasking performance, due to their high compute requirements, many resource-constrained applications still rely on convolutional or hybrid models that combine the benefits of convolution and attention layers and achieve the best results in the sub 100M parameter range. Simultaneously, task adaptation techniques that allow for the use of one shared transformer backbone for multiple downstream tasks, resulting in great storage savings at negligible cost in performance, have not yet been adopted for hybrid transformers. In this work, we investigate how to achieve the best task-adaptation performance and introduce PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers. We further combine PETAH adaptation with pruning to achieve highly performant and storage friendly models for multi-tasking. In our extensive evaluation on classification and other vision tasks, we demonstrate that our PETAH-adapted hybrid models outperform established task-adaptation techniques for ViTs while requiring fewer parameters and being more efficient on mobile hardware.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently perform task adaptation on the hybrid Transformer architecture in a resource - constrained environment to achieve the optimal balance between parameter efficiency and performance. Specifically: 1. **Existing problems**: - Although Transformer models perform excellently in the fields of natural language processing (NLP) and computer vision, their computational requirements are high, and many resource - constrained applications still rely on convolutional neural networks or hybrid models that combine convolutional layers and attention mechanisms. - Current task adaptation techniques mainly focus on pure Transformer models and are not specifically optimized for the hybrid Transformer architecture. 2. **Research objectives**: - Explore how to achieve the best task adaptation performance in the hybrid Transformer architecture. - Introduce the PETAH (Parameter Efficient Task Adaptation for Hybrid Transformers) framework, which simultaneously adjusts the fully - connected layers and convolutional layers through the low - rank adaptation method, thereby achieving a better balance between parameter efficiency and performance. - Combine pruning techniques to further optimize the storage and computational efficiency of the model, making it more suitable for resource - constrained environments such as multi - task processing and mobile devices. 3. **Specific problems**: - How can the performance of the hybrid Transformer model on different downstream tasks be improved without significantly increasing the number of parameters? - Can the convolutional layers in the hybrid model be effectively adjusted by the low - rank adaptation method to improve the flexibility and performance of task adaptation? - In a resource - constrained environment, how can the efficiency and storage - friendliness of the model be ensured? Through the discussion of these problems, the paper aims to provide a new and efficient task adaptation method for the hybrid Transformer architecture, enabling it to be better applied to various computer vision tasks in the case of limited resources.

PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Time-, Memory- and Parameter-Efficient Visual Adaptation

FacT: Factor-Tuning for Lightweight Adaptation on Vision Transformer

Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

Vision Transformer Adapters for Generalizable Multitask Learning

When Parameter-efficient Tuning Meets General-purpose Vision-language Models

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

PetS: A Unified Framework for Parameter-Efficient Transformers Serving

Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow

Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Efficient Low-rank Backpropagation for Vision Transformer Adaptation

Multiple-Exit Tuning: Towards Inference-Efficient Adaptation for Vision Transformer

VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense Scene Understanding

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

Hierarchical Side-Tuning for Vision Transformers