CEFM: CLIP Encoded Fusion Model for multimodal humor recognition on memes

Hou Shuo,Zhang Yijia,Wang Mengyi,Lin Hongfei,Lu Mingyu
DOI: https://doi.org/10.1007/s11042-024-20419-0
IF: 2.577
2024-11-13
Multimedia Tools and Applications
Abstract:With the increasing boom of social media, users can tweet about different events and topics to convey their feelings and emotions. Among these, memes have been gaining popularity over the years. However, it's insufficient to detect whether a meme is humorous or not with current multimodal models. Based on that, we have done the following work in this paper. For the insufficiency of public datasets on multimodal humor detection on memes, we construct a multi-lingual humor detection dataset called HUMEMES which contains over 5000 thousand memes. Secondly, we propose a multimodal fusion model called CEFM using CLIP encoder for better text and image representation. We use proposal and attribute information to enhance the representation of both modalities. Our model systematically analyzes the local and the global perspective of the input meme and relates it to the background context. This method can better integrate multimodal information and achieves results that exceed the baseline methods. The full codes and dataset are available at https://github.com/gilgamesh-nlp/CEFM.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?