Multimodal fusion diagnosis of depression and anxiety based on CNN-LSTM model
Wanqing Xie,Chen Wang,Zhixiong Lin,Xudong Luo,Wenqian Chen,Manzhu Xu,Lizhong Liang,Xiaofeng Liu,Yanzhong Wang,Hui Luo,Mingmei Cheng,Wanqing Xie,Chen Wang,Zhixiong Lin,Xudong Luo,Wenqian Chen,Manzhu Xu,Lizhong Liang,Xiaofeng Liu,Yanzhong Wang,Hui Luo,Mingmei Cheng
DOI: https://doi.org/10.1016/j.compmedimag.2022.102128
IF: 7.422
2022-12-01
Computerized Medical Imaging and Graphics
Abstract:BackgroundIn recent years, more and more people suffer from depression and anxiety. These symptoms are hard to be spotted and can be very dangerous. Currently, the Self-Reported Anxiety Scale (SAS) and Self-Reported Depression Scale (SDS) are commonly used for initial screening for depression and anxiety disorders. However, the information contained in these two scales is limited, while the symptoms of subjects are various and complex, which results in the inconsistency between the questionnaire evaluation results and the clinician's diagnosis results. To fully mine the scale data, we propose a method to extract the features from the facial expression and movements, which are generated from the video recorded simultaneously when subjects fill in the scale. Then we collect the facial expression, movements and scale information to establish a multimodal framework for improving the accuracy and robustness of the diagnosis of depression and anxiety.MethodsWe collect the scale results of the subjects and the videos when filling in the scales. Given the two scales, SAS and SDS, we construct a model with two branches, where each branch processes the multimodal data of SAS and SDS, respectively. In the branch, we first build a convolutional neural network (CNN) to extracts the facial expression features in each frame of images. Secondly, we establish a long short-term memory (LSTM) network to further embedding the facial expression feature and build the connections between various frames, so that the movement feature in the video can be generated. Thirdly, we transform the scale scores into one-hot format, and feed them into the corresponding branch of the network to further mining the information of the multimodal data. Finally, we fuse the embeddings of these two branches to generate inference results of depression and anxiety.Results and conclusionsBased on the score results of SAS and SDS, our multimodal model further mines the video information, and can reach the accuracy of 0.946 in diagnosing depression and anxiety. This study demonstrates the feasibility of using our CNN-LSTM-based multimodal model for initial screening and diagnosis of depression and anxiety disorders with high diagnostic performance.
engineering, biomedical,radiology, nuclear medicine & medical imaging