Abstract:We propose InCA, a lightweight method for transfer learning that cross-attends to any activation layer of a pre-trained model. During training, InCA uses a single forward pass to extract multiple activations, which are passed to external cross-attention adapters, trained anew and combined or selected for downstream tasks. We show that, even when selecting a single top-scoring adapter, InCA achieves performance comparable to full fine-tuning, at a cost comparable to fine-tuning just the last layer. For example, with a cross-attention probe 1.3% the size of a pre-trained ViT-L/16 model, we achieve performance within 0.2% of the full fine-tuning paragon at a computational training cost of 51% of the baseline, on average across 11 downstream classification. Unlike other forms of efficient adaptation, InCA does not require backpropagating through the pre-trained model, thus leaving its execution unaltered at both training and inference. The versatility of InCA is best illustrated in fine-grained tasks, which may require accessing information absent in the last layer but accessible in intermediate layer activations. Since the backbone is fixed, InCA allows parallel ensembling as well as parallel execution of multiple tasks. InCA achieves state-of-the-art performance in the ImageNet-to-Sketch multi-task benchmark.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of efficient, flexible and modular adaptation of large - scale models in downstream tasks. Specifically, the paper proposes a framework named InCA (Introspective Cross - Attention), which realizes effective transfer learning of these models by learning lightweight cross - attention modules attached to the intermediate activation layers of large - scale base models. #### Main problems include: 1. **High computational and storage costs of full - parameter fine - tuning**: - For large - scale models (such as ViT - G/14, which contains more than 1.8 billion parameters), full - parameter fine - tuning requires a large amount of computational resources and memory, which is often unbearable in practical applications. - The model after full - parameter fine - tuning can only be used for specific tasks and cannot share computational resources or execute multiple tasks in parallel. 2. **Limitations of existing efficient adaptation methods**: - Existing efficient parameter adaptation methods (such as LoRA, Visual Prompt Tuning, etc.) reduce the number of parameters, but still have deficiencies in optimization and computational efficiency, especially when dealing with large - scale models. - These methods usually need to update parameters through back - propagation of the entire network, resulting in large computational overhead. 3. **How to effectively utilize the internal representations of pre - trained models**: - Large - scale pre - trained models already contain rich representational capabilities, but how to effectively extract and utilize these representations, especially when there are differences between different downstream tasks, is a challenge. ### InCA's solutions To solve the above problems, InCA proposes the following innovations: - **Lightweight cross - attention adapters**: By introducing lightweight cross - attention modules, InCA can efficiently adapt to new downstream tasks without changing the structure of the base model. - **Parallel training of adapters**: InCA can train multiple adapters in parallel. Each adapter only depends on a certain intermediate activation layer of the base model, thus avoiding back - propagation of the entire base model and significantly reducing computational and memory costs. - **Modularity and flexibility**: The adapter architecture of InCA is modular and can be flexibly combined or used alone, which is suitable for scenarios such as multi - task inference and incremental learning. ### Experimental results The experimental results show that InCA achieves performance comparable to or even better than full - parameter fine - tuning on multiple visual recognition tasks while significantly reducing the demand for computational resources. For example, on the ViT - L/16 architecture, a single adapter only accounts for 1.3% of the total model parameters, but can achieve the accuracy of full - fine - tuning in 11 challenging downstream classification tasks. In addition, InCA also shows its advantages in dealing with larger - scale models (such as ViT - G/14). It can quickly train more than 20 adapters in parallel on a single V100 GPU while reducing GPU memory occupancy by 76%. ### Summary By proposing the InCA framework, the paper solves the problem of efficient adaptation of large - scale models in downstream tasks, not only improving computational and storage efficiency, but also enhancing the flexibility and modular characteristics of the model, providing new ideas and methods for future large - scale model transfer learning.

Your representations are in the network: composable and parallel adaptation for large scale models

Time-, Memory- and Parameter-Efficient Visual Adaptation

Improved Techniques for Training Adaptive Deep Networks

AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

InceptionNeXt: When Inception Meets ConvNeXt

GAttANet: Global attention agreement for convolutional neural networks

Efficient Online Processing with Deep Neural Networks

Advances in inter-edge transfer learning with self-curriculum-labeling adaptive learning and lightweight attention

Adaptable Adapters

ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation

Rethinking the Inception Architecture for Computer Vision

Pay Attention to Convolution Filters: Towards Fast and Accurate Fine-Grained Transfer Learning

Composable Sparse Fine-Tuning for Cross-Lingual Transfer

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning

High-Level Parallelism and Nested Features for Dynamic Inference Cost and Top-Down Attention

Tilt your Head: Activating the Hidden Spatial-Invariance of Classifiers

SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models