Abstract:When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance. Code is available at <a class="link-external link-https" href="https://github.com/quniLcs/MAA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of multimodal feature fusion in fine-grained scene image classification tasks. Specifically, most previous works overly rely on global visual features when performing multimodal feature fusion, neglecting the importance of other modalities such as text and local visual features. This approach is based on a prior intuition that global visual features are always the most distinctive and representative. However, this is not the case in practice, as the importance of different modalities varies in different situations. To tackle this problem, the authors propose a new multimodal feature fusion method called the Modality-Agnostic Adapter (MAA). This method eliminates the distribution differences between different modalities and uses a modality-agnostic Transformer encoder for semantic-level feature fusion. This allows the model to adaptively learn the importance of different modalities in different situations without needing to preset specific modality preferences in the model architecture. ### Main Contributions 1. **Modality-Agnostic Feature Fusion**: The MAA method treats all modalities equally, eliminating the need to design new fusion methods for each modality combination, thereby simplifying model design. 2. **Performance Improvement**: Experimental results show that MAA achieves state-of-the-art performance in fine-grained scene image classification benchmarks. 3. **Ease of Extension**: New modalities can be easily added to MAA, further enhancing model performance. ### Experimental Validation - **Datasets**: Con-Text and Crowd Activity datasets. - **Experimental Setup**: Using ViT for global visual embeddings, KnowBert for text embeddings, and simple local visual embeddings. - **Results**: MAA outperforms existing state-of-the-art models without using additional information; performance is further improved with the addition of local visual embeddings. ### Conclusion The paper proposes a new multimodal feature fusion method, MAA, which achieves adaptive learning of the importance of different modalities through a modality-agnostic Transformer encoder. Experimental results validate the effectiveness of this method and demonstrate its superior performance in fine-grained scene image classification tasks.

Fine-Grained Scene Image Classification with Modality-Agnostic Adapter

Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Self-adaptive attention fusion for multimodal aspect-based sentiment analysis

Improving Fine-grained Image Classification with Multimodal Information

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation

Alignment and Fusion Using Distinct Sensor Data for Multimodal Aerial Scene Classification

MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing

Multimodal Representation Learning by Alternating Unimodal Adaptation

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

MSAF: Multimodal Split Attention Fusion

MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation

SFAF-MA: Spatial Feature Aggregation and Fusion With Modality Adaptation for RGB-Thermal Semantic Segmentation

Equivariant Multi-Modality Image Fusion

MSFNet: modality smoothing fusion network for multimodal aspect-based sentiment analysis

Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

Deep Multimodal Data Fusion

MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions