S3 Agent: Unlocking the Power of VLLM for Zero-Shot Multi-modal Sarcasm Detection

Peng Wang,Yongheng Zhang,Hao Fei,Qiguang Chen,Yukai Wang,Jiasheng Si,Wenpeng Lu,Min Li,Libo Qin
DOI: https://doi.org/10.1145/3690642
2024-01-01
Abstract:Multi-modal sarcasm detection involves determining whether a given multi-modal input conveys sarcastic intent by analyzing the underlying sentiment. Recently, vision large language models have shown remarkable success on various of multi-modal tasks. Inspired by this, we systematically investigate the impact of vision large language models in zero-shot multi-modal sarcasm detection task. Furthermore, to capture different perspectives of sarcastic expressions, we propose a multi-view agent framework, S3 Agent, designed to enhance zero-shot multi-modal sarcasm detection by leveraging three critical perspectives: superficial expression, semantic information, and sentiment expression. Our experiments on the MMSD2.0 dataset, which involves six models and four prompting strategies, demonstrate that our approach achieves state-of-the-art performance. Our method achieves an average improvement of 13.2% in accuracy. Moreover, we evaluate our method on the text-only sarcasm detection task, where it also surpasses baseline approaches.
What problem does this paper attempt to address?