Abstract:In this paper, we consider the problem of prototype-based vision-language reasoning problem. We observe that existing methods encounter three major challenges: 1) escalating resource demands and prolonging training times, 2) contending with excessive learnable parameters, and 3) fine-tuning based only on a single modality. These challenges will hinder their capability to adapt Vision-Language Models (VLMs) to downstream tasks. Motivated by this critical observation, we propose a novel method called NODE-Adapter, which utilizes Neural Ordinary Differential Equations for better vision-language reasoning. To fully leverage both visual and textual modalities and estimate class prototypes more effectively and accurately, we divide our method into two stages: cross-modal prototype construction and cross-modal prototype optimization using neural ordinary differential equations. Specifically, we exploit VLM to encode hand-crafted prompts into textual features and few-shot support images into visual features. Then, we estimate the textual prototype and visual prototype by averaging the textual features and visual features, respectively, and adaptively combine the textual prototype and visual prototype to construct the cross-modal prototype. To alleviate the prototype bias, we then model the prototype optimization process as an initial value problem with Neural ODEs to estimate the continuous gradient flow. Our extensive experimental results, which cover few-shot classification, domain generalization, and visual reasoning on human-object interaction, demonstrate that the proposed method significantly outperforms existing state-of-the-art approaches.

What problem does this paper attempt to address?

This paper attempts to address three main challenges that Vision - Language Models (VLMs) encounter when adapting to downstream tasks: 1. **Increased resource requirements and extended training time**: Existing methods require more computational resources when dealing with large - scale datasets, and the training time increases significantly. 2. **Excessive learnable parameters**: Many methods introduce a large number of learnable parameters, leading to increased model complexity and greater training difficulty. 3. **Fine - tuning based on a single modality only**: Existing fine - tuning methods usually rely on a single modality (such as vision or text), which limits the model's effective utilization of multi - modal information. To solve these problems, the authors propose a new method named NODE - Adapter. This method optimizes cross - modal prototypes by using Neural Ordinary Differential Equations (Neural ODEs), thereby enhancing the ability of visual - language reasoning. Specifically, NODE - Adapter divides the method into two stages: 1. **Cross - modal prototype construction**: First, use a pre - trained VLM (such as CLIP) to encode hand - crafted prompts into text features and a small number of sample images into visual features. Then, estimate the text prototype and the visual prototype by averaging these features and adaptively combine these two prototypes to construct cross - modal prototypes. 2. **Cross - modal prototype optimization**: To reduce prototype bias, the authors model the prototype optimization process as an initial value problem and use Neural ODEs to estimate the continuous gradient flow, thereby optimizing the prototype. This method can more effectively capture the dynamic characteristics of the prototype and reduce the bias caused by limited training data. In this way, NODE - Adapter can significantly outperform the existing state - of - the - art methods in tasks such as few - shot classification, domain generalization, and visual reasoning of human - object interactions. ### Mathematical formula summary - **Text prototype calculation**: \[ t_{j,i}=E_t(\{\pi_i; \text{class } j\})\in\mathbb{R}^D \] \[ P_t = [\bar{t}_1, \bar{t}_2,\cdots, \bar{t}_N]\in\mathbb{R}^{N\times D} \] - **Visual prototype calculation**: \[ P_v = [\bar{v}_1, \bar{v}_2,\cdots, \bar{v}_N]\in\mathbb{R}^{N\times D} \] - **Initial cross - modal prototype construction**: \[ p_j=\lambda_j\cdot\bar{v}_j+(1 - \lambda_j)\cdot\bar{t}_j \] where \[ \lambda_j=\frac{1}{1+\exp(-\bar{v}_j^\top u)} \] - **Neural ODEs model**: \[ \frac{dp(t)}{dt}=f_\theta(p(t),t,S) \] where \( f_\theta \) is a neural network parameterized by the parameter \( \theta \), and \( S \) is the support set. - **Forward propagation**: \[ P(t_m)=P(t_0)+\int_{t_0}^{t_m}f_\theta(p(t),t,S)dt=\text{ODESolve}(p(t_0),f_\theta,t_0,t_m,S) \] - **Backward propagation**: \[ \frac{\partial L}{\partial\theta}

NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

Tackling Vision Language Tasks Through Learning Inner Monologues

Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

Unsupervised Prototype Adapter for Vision-Language Models

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter

NMN-VD: A Neural Module Network for Visual Dialog

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Multi-Modal Adapter for Vision-Language Models

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Deep Neural Networks for Visual Reasoning

Visual Coreference Resolution in Visual Dialog using Neural Module Networks

Smart Vision-Language Reasoners

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding