Abstract:Smiles and laughs detection systems have attracted a lot of attention in the past decade contributing to the improvement of human-agent interaction systems. But very few considered these expressions as distinct, although no prior work clearly proves them to belong to the same category or not. In this work, we present a deep learning-based multimodal smile and laugh classification system, considering them as two different entities. We compare the use of audio and vision-based models as well as a fusion approach. We show that, as expected, the fusion leads to a better generalization on unseen data. We also present an in-depth analysis of the behavior of these models on the smiles and laughs intensity levels. The analyses on the intensity levels show that the relationship between smiles and laughs might not be as simple as a binary one or even grouping them in a single category, and so, a more complex approach should be taken when dealing with them. We also tackle the problem of limited resources by showing that transfer learning allows the models to improve the detection of confusing intensity levels.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address several key issues in smile and laughter detection systems: 1. **Distinguishing between Smile and Laughter**: Although both smile and laughter are crucial in human communication, previous studies often treat them as the same category of expression or do not clearly demonstrate whether they belong to different categories. This paper proposes a deep learning-based multimodal classification system that treats smile and laughter as two independent entities for classification. 2. **Multimodal Fusion**: The paper compares the effectiveness of using audio and visual models separately versus combining these two modalities. The study finds that the fusion method has better generalization ability on unseen data. 3. **Impact of Intensity Levels**: The authors conduct an in-depth analysis of smiles and laughter at different intensity levels, discovering that these intensity levels are not equivalent in the recognition process of deep learning systems. This indicates that the relationship between smile and laughter may be more complex than a simple binary or single classification. 4. **Limited Resources**: Collecting and annotating smile and laughter data in natural scenes is very challenging, resulting in very limited high-quality data. This paper uses transfer learning techniques to leverage knowledge from models trained on speech data to improve the performance and generalization ability of the smile and laughter detection system. ### Main Contributions 1. **Proposing the First Deep Learning Classification System that Treats Smile and Laughter as Different Entities**: This is the first attempt to distinguish and classify smile and laughter separately. 2. **In-depth Analysis of Model Behavior**: The study shows that deep learning systems implicitly consider the intensity levels of smile and laughter without explicit intensity level labels. 3. **Application of Transfer Learning**: By transferring knowledge from lip-reading tasks and audio word classification tasks, the model's performance is improved, addressing the issue of limited data resources. ### Conclusion 1. **Benefits of Transfer Learning**: Transfer learning significantly improves performance and generalization in most cases and should be prioritized over training models from scratch. 2. **Relationship between Smile and Laughter**: Observations and analyses of intensity levels lead to the conclusion that the relationship between smile and laughter is not a simple binary or single classification but a more complex structure. ### Future Work 1. **Dataset Improvement**: The current dataset contains some interfering factors (such as overlapping voices of interlocutors), which need further cleaning to improve detection accuracy. 2. **Subjectivity of Annotations**: Due to the limited number of annotators, annotations may be subjective, potentially leading to labeling errors. Future work can increase the number of annotators to reduce the impact of subjectivity. 3. **More Complex Model Fusion**: Currently, a simple fully connected layer fusion mechanism is used. Future work can explore more complex fusion methods to better utilize multimodal information.

A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Distinguishing Posed and Spontaneous Smiles by Facial Dynamics

SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models

Laughter and smiling facial expression modelling for the generation of virtual affective behavior

Design and Development of Laughter Recognition System Based on Multimodal Fusion and Deep Learning

Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild

Smile detection in the wild based on transfer learning

How Do You Smile? Towards a Comprehensive Smile Analysis System

Multimodal Emotion Recognition by Combining Physiological Signals and Facial Expressions: a Preliminary Study.

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

Analysis of Co-Laughter Gesture Relationship on RGB videos in Dyadic Conversation Contex

Multi-modal emotion analysis from facial expressions and electroencephalogram.

Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Smile: Spiking Multi-Modal Interactive Label-Guided Enhancement Network for Emotion Recognition

An Efficient Approach to Smile Detection

Methods of Recognizing True and Fake Smiles by Using Au6 and Au12 in A Holistic Way

From Generalized Laughter to Personalized Chuckles: Unleashing the Power of Data Fusion in Subjective Humor Detection

Fine-Grained Facial Expression Recognition in Multiple Smiles

Spontaneous vs. Posed smiles - can we tell the difference?

Evidence for Distinct Facial Signals of Reward, Affiliation, and Dominance from Both Perception and Production Tasks

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages