The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

Longju Bai,Angana Borah,Oana Ignat,Rada Mihalcea
2024-11-19
Abstract:Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at <a class="link-external link-https" href="https://github.com/MichiganNLP/MosAIC" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limited performance of existing large - scale multimodal models (LMMs) in a cross - cultural context. Most of the data and models have a distinct Western - centric tendency, which restricts their effectiveness in handling image captioning tasks in different cultural backgrounds. To overcome this limitation, the author proposes an approach based on a multi - agent interaction framework to enhance cross - cultural image captioning capabilities. Specifically, this research: 1. **Introduces MosAIC**: This is a multi - agent framework designed to enhance cross - cultural image captioning by using LMMs with different cultural characteristics. 2. **Provides a rich data set**: It contains 2,832 images with culturally rich descriptions from China, India, and Romania. These images are from three different data sets: GeoDE, GD - VCR, and CVQA. 3. **Proposes a cultural adaptability evaluation metric**: It is used to evaluate the cultural information in image captions. 4. **Demonstrates that multi - agent interaction is superior to single - agent models**: Under different evaluation metrics, the multi - agent interaction model outperforms the single - agent model and provides valuable insights for future research. Through these contributions, this research aims to improve the performance of multimodal models in cross - cultural image captioning tasks, thereby better capturing and expressing visual content in different cultural backgrounds.