Integrating Content-Semantics-World Knowledge to Detect Stress from Videos

Yang Ding,Yi Dai,Xin Wang,Ling Feng,Lei Cao,Huijun Zhang
DOI: https://doi.org/10.1145/3664647.3680584
2024-01-01
Abstract:Stress has rapidly emerged as a significant public health concern in the contemporary society, necessitating prompt identification and effective intervention strategies. Video-based stress detection offers a non-invasive, low-cost, and mass-reaching approach for identifying stress. In this paper, we propose a three-level content-semantic-world knowledge framework, addressing three particular issues for video-based stress detection. (1) How to abstract and encode video semantics with frame contents into visual representation? (2) How to leverage general-purpose LMMs to augment task-specific visual representation? (3) To what extent could general-purpose LMMs contribute to video-based stress detection? We design a Slow-Emotion-Fast-Action scheme to encode fast temporal changes of body actions revealed from video frames, as well as subtle details of emotions per video segment, into visual representation. We augment task-specific visual representation with linguistic facial expression descriptions by prompting general-purpose Large Multimodal Models (LMMs). A knowledge retriever is designed to evaluate and select the most proper deliverable of LMMs. Experimental results on two video-based stress detection datasets show that 1) our proposed three-level framework can achieve 90.89% F1-score in UVSD dataset and 80.79% F1-score, outperforming state-of-the-art; 2) leveraging LMMs helps to improve the F1-score by 2.25% in UVSD and 3.55% in RSL, compared to using the traditional Facial Action Coding System; 3) purely relying on general-purpose LMMs is insufficient with 88.73% F1-score in UVSD dataset and 77.48% F1-score in RSL dataset, demonstrating the necessity to combine task-specific dedicated solutions with world knowledge given by LMMs.
What problem does this paper attempt to address?