Abstract:Pre-training & fine-tuning can enhance the transferring efficiency and performance in visual tasks. Recent delta-tuning methods provide more options for visual classification tasks. Despite their success, existing visual delta-tuning art fails to exceed the upper limit of full fine-tuning on challenging tasks like object detection and segmentation. To find a competitive alternative to full fine-tuning, we propose the Multi-cognitive Visual Adapter (Mona) tuning, a novel adapter-based tuning method. First, we introduce multiple vision-friendly filters into the adapter to enhance its ability to process visual signals, while previous methods mainly rely on language-friendly linear filters. Second, we add the scaled normalization layer in the adapter to regulate the distribution of input features for visual filters. To fully demonstrate the practicality and generality of Mona, we conduct experiments on multiple representative visual tasks, including instance segmentation on COCO, semantic segmentation on ADE20K, object detection on Pascal VOC, oriented object detection on DOTA/STAR, and image classification on three common datasets. Exciting results illustrate that Mona surpasses full fine-tuning on all these tasks, and is the only delta-tuning method outperforming full fine-tuning on the above various tasks. For example, Mona achieves 1% performance gain on the COCO dataset compared to full fine-tuning. Comprehensive results suggest that Mona-tuning is more suitable for retaining and utilizing the capabilities of pre-trained models than full fine-tuning. The code will be released at <a class="link-external link-https" href="https://github.com/Leiyi-Hu/mona" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to exceed the performance upper limit of full fine - tuning in visual recognition tasks while reducing the number of parameter updates**. Specifically, although the existing delta - tuning methods perform well on simple tasks, they cannot surpass the effect of full fine - tuning in complex visual tasks such as object detection and segmentation tasks. For this reason, the author proposes the Multi - cognitive Visual Adapter (Mona) tuning method, aiming to enhance the visual signal processing ability by optimizing the input distribution and introducing multi - cognitive convolutional filters, so as to exceed full fine - tuning in several representative visual tasks. ### Main Problem Summary 1. **Limitations of Existing Delta - Tuning Methods**: - Although delta - tuning methods perform well on simple tasks, they cannot surpass full fine - tuning in complex visual tasks such as object detection and segmentation tasks. - The existing visual adapter designs mainly rely on linear filters, which are more suitable for processing language signals rather than visual signals. 2. **Bottlenecks of Full Fine - Tuning**: - Although full fine - tuning has excellent performance, it needs to update all parameters, resulting in high consumption of computing resources and high storage costs. - On some datasets (such as Pascal VOC), full fine - tuning may lead to over - fitting. 3. **Goals of Proposing the Mona Tuning Method**: - Enhance the visual signal processing ability by introducing multi - cognitive convolutional filters and optimizing the input distribution. - Exceed the performance of full fine - tuning in several representative visual tasks (such as instance segmentation, semantic segmentation, object detection, etc.). - Reduce the number of newly introduced parameters and lower the storage and computing costs. ### Solutions The author proposes the Mona tuning method, which mainly includes the following innovations: 1. **Multi - cognitive Convolutional Filters**: - Introduce multiple depthwise convolutions to process visual signals at different scales. - Use 3x3, 5x5 and 7x7 convolution kernels and aggregate features through 1x1 convolutions to enhance the understanding of visual signals. 2. **Input Optimization**: - Add scaled layer normalization in front of the adapter to adjust the input distribution so that it can better adapt to the data distribution of new tasks. - Control the proportion of the input through two learnable weight parameters \( s_1 \) and \( s_2 \), and the formula is: \[ x_{\text{norm}} = s_1\cdot|x_0|_{\text{LN}}+s_2\cdot x_0 \] where \( |\cdot|_{\text{LN}} \) represents the LayerNorm operation and \( x_0 \) is the original input. 3. **Experimental Verification**: - Conducted extensive experiments on several representative visual tasks, including instance segmentation (COCO), semantic segmentation (ADE20K), object detection (Pascal VOC), oriented object detection (DOTA/STAR) and image classification. - The experimental results show that the Mona tuning method outperforms full fine - tuning in all these tasks. In particular, on the COCO dataset, Mona achieves a 1% performance improvement. ### Conclusion By introducing multi - cognitive convolutional filters and optimizing the input distribution, the Mona tuning method not only exceeds the performance of full fine - tuning in multiple visual tasks, but also significantly reduces the number of newly introduced parameters and lowers the storage and computing costs. This indicates that the Mona tuning method is an efficient and practical alternative and may become the preferred method for visual task transfer learning in the future.

5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition

Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision Transformers

Minimal Interaction Edge Tuning: A New Paradigm for Visual Adaptation

Mini but Mighty: Finetuning ViTs with Mini Adapters

Tuning Vision-Language Models with Multiple Prototypes Clustering

AMF: Adaptable Weighting Fusion with Multiple Fine-tuning for Image Classification

PROFIT: A Specialized Optimizer for Deep Fine Tuning

DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration

Atten-Adapter: A Unified Attention-Based Adapter for Efficient Tuning

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

Tuning-Free Visual Customization via View Iterative Self-Attention Control

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

Split & Merge: Unlocking the Potential of Visual Adapters via Sparse Training

What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Consolidator: Mergeable Adapter with Grouped Connections for Visual Adaptation

One Step Learning, One Step Review

Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation