Multi-modal Feature Fistillation Emotion Recognition Method for Social Media

Xue Zhang,Mingjiang Wang,Xiao Zeng
DOI: https://doi.org/10.1109/qrs62785.2024.00051
2024-01-01
Abstract:With the rise of social media as a primary channel for information exchange, the spread of online public sentiment has become a key factor in escalating social conflicts and triggering public concern, exerting profound influences on social stability and public values. On some instant interaction platforms, social media texts are often short and noisy, rendering traditional sentiment recognition methods unsuitable for scenarios with scarce short-text content and semantic and emotional uncertainty. To address these challenges, this study proposes a TVE-MGF model (Textual-Visual Context Enhancement and Multi-Granularity Semantic Fusion Model), which utilizes ViLBERT for indepth multimodal feature extraction from both visual context and semantic content. In addition, by enhancing the expression of textual and visual semantics, the model achieves multi-granularity fusion of social media text and associated images, enhancing the precision and comprehensive understanding in sentiment analysis. Furthermore, we adopted feature distillation techniques to optimize the TVE-MGF model, aiming to more effectively extract and utilize implicit knowledge from both visual and textual data to construct higher-level semantic feature representations. This step has bolstered the model’s capacity for generalization and extraction of critical knowledge, significantly improving performance when handling complex multimodal sentiment data. Finally, the methodology was experimentally validated on two multimodal datasets, MVSA-single and MVSA-multiple, with the TVE-MGF model achieving F1 scores of $78.10 \%$ and $79.03 \%$ respectively, thereby demonstrating its effectiveness in enhancing the efficiency of sentiment recognition in social media, particularly for texts with high semantic and emotional uncertainty.
What problem does this paper attempt to address?