Md Kamrul Hasan,Wasifur Rahman,Amir Zadeh,Jianyuan Zhong,Md Iftekhar Tanveer,Louis-Philippe Morency,Mohammed,Hoque
Abstract:Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand and model humorous expressions in multimodal languages. Specifically, the author aims to fill the research gap in the field of multimodal humor detection by introducing a multimodal dataset named UR - FUNNY. This dataset includes information in three modalities: text, vision, and acoustics, which are used to train and evaluate machine - learning models to detect whether a passage will trigger immediate laughter among the audience, that is, to determine whether the last sentence constitutes a punchline.
### Main Problems
1. **Understanding and Modeling of Multimodal Humor**: Humor is usually conveyed through multiple modalities (such as words, expressions, voices), and there are complex interaction relationships among these modalities. How to effectively combine these modalities to understand and model humor is a challenge.
2. **Context - Dependence**: Humor often depends on the gradual buildup of the story, and finally triggers laughter through a sudden turn (i.e., the punchline). Therefore, understanding humor requires analyzing the context information before the punchline.
3. **Acquisition and Processing of Multimodal Data**: In order to effectively model multimodal humor, a dataset with rich diversity is required, covering different speakers and topics, and being able to accurately align text, visual, and acoustic modalities.
### Solutions
- **UR - FUNNY Dataset**: This dataset is extracted from TED talk videos and contains 8,257 humorous instances and 8,257 non - humorous instances. Each instance includes data in text, visual, and acoustic modalities, and is annotated with punchline and context information.
- **Multimodal Fusion Model**: The author proposes an extended model based on the Memory Fusion Network (MFN) - the Context - Memory Fusion Network (C - MFN) for handling multimodal humor - detection tasks. C - MFN better captures context information and interaction relationships between modalities by introducing unimodal context networks and multimodal context networks.
### Experimental Results
The experimental results show that the model using all three modalities of text, vision, and acoustics simultaneously has the best performance, indicating the importance of multimodal information for humor detection. In addition, punchline information is more important than pure context information, but the combination of the two can further improve the accuracy of the model.
### Formula Representation
When describing the model structure and experimental results, the formulas involved are as follows:
- **Problem Definition**:
- Each data sample can be represented as a triple \((l, P, C)\), where \(l\) is a binary label (humorous or non - humorous), \(P\) is the punchline, and \(C\) is the context.
- Both the punchline and the context contain multiple modalities: \(P=\{P_m; m\in M\}\), \(C = \{C_m; m\in M\}\), where \(M=\{t, v, a\}\) represent text, visual, and acoustic modalities respectively.
- **Model Structure**:
- **Unimodal Context Network**: Use LSTM to encode the information of each modality, and the output is \(\mathbf{H}=\{h_{m,n}; m\in M, 1\leq n < N_C\}\).
- **Multimodal Context Network**: Fuse unimodal context information through the self - attention mechanism, and the output is \(\hat{\mathbf{H}}\).
- **Memory Fusion Network (MFN)**: Initialize the memory unit of MFN, use \(\mathbf{H}\) and \(\hat{\mathbf{H}}\), and update the multi - view gated memory through the Delta - memory Attention network.
Through these methods, this paper successfully demonstrates how to use multimodal datasets and advanced deep - learning models to understand and model humorous expressions.