Benchmarking VLMs' Reasoning About Persuasive Atypical Images

Sina Malakouti,Aysan Aghazadeh,Ashmit Khandelwal,Adriana Kovashka
2024-10-10
Abstract:Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer's lightness. We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Aypical Object Recognition, to benchmark VLMs' understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad's message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements. Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
This paper attempts to address the problem of the lack of ability of visual - language models (VLMs) in understanding atypical images (such as advertisements) with rhetorical and persuasive features. Specifically, the researchers focus on: 1. **Whether current VLMs can reason about atypicality and understand advertisements**: Although VLMs have shown strong zero - sample generalization ability in multiple tasks, their performance in understanding complex and persuasive advertisement images has not been deeply studied. 2. **The impact of atypicality on understanding advertisement images**: Advertisements often use unusual combinations of objects to convey specific information, for example, through texture replacement, object nesting and other techniques. These techniques require the model to have advanced reasoning ability to correctly interpret the advertising intention. To this end, the author introduced three new tasks to evaluate the VLMs' understanding of atypicality: - **Multi - label Atypicality Classification (MAC)**: Predict multiple atypicality categories present in the image. - **Atypicality Statement Retrieval (ASR)**: Find the correct statement describing the atypical relationship between two objects from multiple candidate statements. - **Atypical Object Recognition (AOR)**: Generate the correct primary and secondary objects according to the given atypical relationship to complete the atypicality statement. In addition, the author also proposed a novel method, that is, by generating semantically challenging negative examples (such as wrong actions or reasons), to more strictly test the model's reasoning ability. The experimental results show that although VLMs have difficulties in directly inferring atypicality, valuable information about atypical aspects can be extracted through appropriate prompting strategies and used to enhance image description and the final action - reason retrieval performance. In conclusion, this work reveals the limitations of VLMs' reasoning ability in dealing with complex and atypical visual media, and provides new benchmarks and directions for future research.