A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

Wei Zhang,Kaining Mao,Jie Chen
DOI: https://doi.org/10.1007/s43657-023-00152-8
2024-05-03
Phenomics
Abstract:Depression is one of the most common mental disorders, and rates of depression in individuals increase each year. Traditional diagnostic methods are primarily based on professional judgment, which is prone to individual bias. Therefore, it is crucial to design an effective and robust diagnostic method for automated depression detection. Current artificial intelligence approaches are limited in their abilities to extract features from long sentences. In addition, current models are not as robust with large input dimensions. To solve these concerns, a multimodal fusion model comprised of text, audio, and video for both depression detection and assessment tasks was developed. In the text modality, pre-trained sentence embedding was utilized to extract semantic representation along with Bidirectional long short-term memory (BiLSTM) to predict depression. This study also used Principal component analysis (PCA) to reduce the dimensionality of the input feature space and Support vector machine (SVM) to predict depression based on audio modality. In the video modality, Extreme gradient boosting (XGBoost) was employed to conduct both feature selection and depression detection. The final predictions were given by outputs of the different modalities with an ensemble voting algorithm. Experiments on the Distress analysis interview corpus wizard-of-Oz (DAIC-WOZ) dataset showed a great improvement of performance, with a weighted F1 score of 0.85, a Root mean square error (RMSE) of 5.57, and a Mean absolute error (MAE) of 4.48. Our proposed model outperforms the baseline in both depression detection and assessment tasks, and was shown to perform better than other existing state-of-the-art depression detection methods. Supplementary information: The online version contains supplementary material available at 10.1007/s43657-023-00152-8.
What problem does this paper attempt to address?