Depressformer: Leveraging Video Swin Transformer and fine-grained local features for depression scale estimation

Lang He,Zheng Li,Prayag Tiwari,Cui Cao,Jize Xue,Feng Zhu,Di Wu
DOI: https://doi.org/10.1016/j.bspc.2024.106490
IF: 5.1
2024-06-02
Biomedical Signal Processing and Control
Abstract:Background and Objective : By 2030, depression is projected to become the predominant mental disorder. With the rising prominence of depression, a great number of affective computing studies has been observed, with the majority emphasizing the use of audiovisual methods for estimating depression scales. Present studies often overlook the potential patterns of sequential data and not adopt the fine-grained features of Transformer to model the behavior features for video-based depression recognition (VDR). Methods: To address above-mentioned gaps, we present an end-to-end sequential framework called Depressformer for VDR. This innovative structure is delineated into the three structures: the Video Swin Transformer (VST) for deep feature extraction, a module dedicated to depression-specific fine-grained local feature extraction (DFLFE), and the depression channel attention fusion (DCAF) module to fuse the latent local and global features. By utilizing the VST as a backbone network, it is possible to discern pivotal features more effectively. The DFLFE enriches this process by focusing on the nuanced local features indicative of depression. To enhance the modeling of combined features pertinent to VDR, DCAF module is also presented. Results: Our methodology underwent extensive validations using the AVEC2013/2014 depression databases. The empirical results underscore its efficacy, yielding a root mean square error (RMSE) of 7.47 and a mean absolute error (MAE) of 5.49 for the first dataset. For the second database, the corresponding values were 7.22 and 5.56, respectively. And the F1-score is 0.59 on the D-vlog dataset. Conclusions: In summary, the experimental evaluations suggest that Depressformer architecture demonstrates superior performances with stability and adaptability across various tasks, making it capable of effectively identifying the severity of depression. Code will released at the link: https://github.com/helang818/Depressformer/ .
engineering, biomedical
What problem does this paper attempt to address?