Fine-Grained Scene Image Classification with Modality-Agnostic Adapter

Yiqun Wang,Zhao Zhou,Xiangcheng Du,Xingjiao Wu,Yingbin Zheng,Cheng Jin
2024-07-03
Abstract:When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance. Code is available at <a class="link-external link-https" href="https://github.com/quniLcs/MAA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of multimodal feature fusion in fine-grained scene image classification tasks. Specifically, most previous works overly rely on global visual features when performing multimodal feature fusion, neglecting the importance of other modalities such as text and local visual features. This approach is based on a prior intuition that global visual features are always the most distinctive and representative. However, this is not the case in practice, as the importance of different modalities varies in different situations. To tackle this problem, the authors propose a new multimodal feature fusion method called the Modality-Agnostic Adapter (MAA). This method eliminates the distribution differences between different modalities and uses a modality-agnostic Transformer encoder for semantic-level feature fusion. This allows the model to adaptively learn the importance of different modalities in different situations without needing to preset specific modality preferences in the model architecture. ### Main Contributions 1. **Modality-Agnostic Feature Fusion**: The MAA method treats all modalities equally, eliminating the need to design new fusion methods for each modality combination, thereby simplifying model design. 2. **Performance Improvement**: Experimental results show that MAA achieves state-of-the-art performance in fine-grained scene image classification benchmarks. 3. **Ease of Extension**: New modalities can be easily added to MAA, further enhancing model performance. ### Experimental Validation - **Datasets**: Con-Text and Crowd Activity datasets. - **Experimental Setup**: Using ViT for global visual embeddings, KnowBert for text embeddings, and simple local visual embeddings. - **Results**: MAA outperforms existing state-of-the-art models without using additional information; performance is further improved with the addition of local visual embeddings. ### Conclusion The paper proposes a new multimodal feature fusion method, MAA, which achieves adaptive learning of the importance of different modalities through a modality-agnostic Transformer encoder. Experimental results validate the effectiveness of this method and demonstrate its superior performance in fine-grained scene image classification tasks.