A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Hugo Bohy,Kevin El Haddad,Thierry Dutoit
DOI: https://doi.org/10.1109/ACII55700.2022.9953896
2024-03-04
Abstract:Smiles and laughs detection systems have attracted a lot of attention in the past decade contributing to the improvement of human-agent interaction systems. But very few considered these expressions as distinct, although no prior work clearly proves them to belong to the same category or not. In this work, we present a deep learning-based multimodal smile and laugh classification system, considering them as two different entities. We compare the use of audio and vision-based models as well as a fusion approach. We show that, as expected, the fusion leads to a better generalization on unseen data. We also present an in-depth analysis of the behavior of these models on the smiles and laughs intensity levels. The analyses on the intensity levels show that the relationship between smiles and laughs might not be as simple as a binary one or even grouping them in a single category, and so, a more complex approach should be taken when dealing with them. We also tackle the problem of limited resources by showing that transfer learning allows the models to improve the detection of confusing intensity levels.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to address several key issues in smile and laughter detection systems: 1. **Distinguishing between Smile and Laughter**: Although both smile and laughter are crucial in human communication, previous studies often treat them as the same category of expression or do not clearly demonstrate whether they belong to different categories. This paper proposes a deep learning-based multimodal classification system that treats smile and laughter as two independent entities for classification. 2. **Multimodal Fusion**: The paper compares the effectiveness of using audio and visual models separately versus combining these two modalities. The study finds that the fusion method has better generalization ability on unseen data. 3. **Impact of Intensity Levels**: The authors conduct an in-depth analysis of smiles and laughter at different intensity levels, discovering that these intensity levels are not equivalent in the recognition process of deep learning systems. This indicates that the relationship between smile and laughter may be more complex than a simple binary or single classification. 4. **Limited Resources**: Collecting and annotating smile and laughter data in natural scenes is very challenging, resulting in very limited high-quality data. This paper uses transfer learning techniques to leverage knowledge from models trained on speech data to improve the performance and generalization ability of the smile and laughter detection system. ### Main Contributions 1. **Proposing the First Deep Learning Classification System that Treats Smile and Laughter as Different Entities**: This is the first attempt to distinguish and classify smile and laughter separately. 2. **In-depth Analysis of Model Behavior**: The study shows that deep learning systems implicitly consider the intensity levels of smile and laughter without explicit intensity level labels. 3. **Application of Transfer Learning**: By transferring knowledge from lip-reading tasks and audio word classification tasks, the model's performance is improved, addressing the issue of limited data resources. ### Conclusion 1. **Benefits of Transfer Learning**: Transfer learning significantly improves performance and generalization in most cases and should be prioritized over training models from scratch. 2. **Relationship between Smile and Laughter**: Observations and analyses of intensity levels lead to the conclusion that the relationship between smile and laughter is not a simple binary or single classification but a more complex structure. ### Future Work 1. **Dataset Improvement**: The current dataset contains some interfering factors (such as overlapping voices of interlocutors), which need further cleaning to improve detection accuracy. 2. **Subjectivity of Annotations**: Due to the limited number of annotators, annotations may be subjective, potentially leading to labeling errors. Future work can increase the number of annotators to reduce the impact of subjectivity. 3. **More Complex Model Fusion**: Currently, a simple fully connected layer fusion mechanism is used. Future work can explore more complex fusion methods to better utilize multimodal information.