Abstract:Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand and model humorous expressions in multimodal languages. Specifically, the author aims to fill the research gap in the field of multimodal humor detection by introducing a multimodal dataset named UR - FUNNY. This dataset includes information in three modalities: text, vision, and acoustics, which are used to train and evaluate machine - learning models to detect whether a passage will trigger immediate laughter among the audience, that is, to determine whether the last sentence constitutes a punchline. ### Main Problems 1. **Understanding and Modeling of Multimodal Humor**: Humor is usually conveyed through multiple modalities (such as words, expressions, voices), and there are complex interaction relationships among these modalities. How to effectively combine these modalities to understand and model humor is a challenge. 2. **Context - Dependence**: Humor often depends on the gradual buildup of the story, and finally triggers laughter through a sudden turn (i.e., the punchline). Therefore, understanding humor requires analyzing the context information before the punchline. 3. **Acquisition and Processing of Multimodal Data**: In order to effectively model multimodal humor, a dataset with rich diversity is required, covering different speakers and topics, and being able to accurately align text, visual, and acoustic modalities. ### Solutions - **UR - FUNNY Dataset**: This dataset is extracted from TED talk videos and contains 8,257 humorous instances and 8,257 non - humorous instances. Each instance includes data in text, visual, and acoustic modalities, and is annotated with punchline and context information. - **Multimodal Fusion Model**: The author proposes an extended model based on the Memory Fusion Network (MFN) - the Context - Memory Fusion Network (C - MFN) for handling multimodal humor - detection tasks. C - MFN better captures context information and interaction relationships between modalities by introducing unimodal context networks and multimodal context networks. ### Experimental Results The experimental results show that the model using all three modalities of text, vision, and acoustics simultaneously has the best performance, indicating the importance of multimodal information for humor detection. In addition, punchline information is more important than pure context information, but the combination of the two can further improve the accuracy of the model. ### Formula Representation When describing the model structure and experimental results, the formulas involved are as follows: - **Problem Definition**: - Each data sample can be represented as a triple \((l, P, C)\), where \(l\) is a binary label (humorous or non - humorous), \(P\) is the punchline, and \(C\) is the context. - Both the punchline and the context contain multiple modalities: \(P=\{P_m; m\in M\}\), \(C = \{C_m; m\in M\}\), where \(M=\{t, v, a\}\) represent text, visual, and acoustic modalities respectively. - **Model Structure**: - **Unimodal Context Network**: Use LSTM to encode the information of each modality, and the output is \(\mathbf{H}=\{h_{m,n}; m\in M, 1\leq n < N_C\}\). - **Multimodal Context Network**: Fuse unimodal context information through the self - attention mechanism, and the output is \(\hat{\mathbf{H}}\). - **Memory Fusion Network (MFN)**: Initialize the memory unit of MFN, use \(\mathbf{H}\) and \(\hat{\mathbf{H}}\), and update the multi - view gated memory through the Delta - memory Attention network. Through these methods, this paper successfully demonstrates how to use multimodal datasets and advanced deep - learning models to understand and model humorous expressions.

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results

Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models

We Are Humor Beings: Understanding and Predicting Visual Humor

Is AI fun? HumorDB: a curated dataset and benchmark to investigate graphical humor

Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation

FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

#HashtagWars: Learning a Sense of Humor

A Two-Model Approach for Humour Style Recognition

"So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy

HumourHindiNet: Humour detection in Hindi web series using word embedding and convolutional neural network

CEFM: CLIP Encoded Fusion Model for multimodal humor recognition on memes

Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning

Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba

DPP: A Dual-Phase Processing Method for Cross-Cultural Humor Detection

DeHumor: Visual Analytics for Decomposing Humor

Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor

OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?

QuMIN: quantum multi-modal data fusion for humor detection

From Generalized Laughter to Personalized Chuckles: Unleashing the Power of Data Fusion in Subjective Humor Detection

Multimodal Cross-Lingual Features and Weight Fusion for Cross-Cultural Humor Detection