SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media Analysis

Hongzhi Qi,Hanfei Liu,Jianqiang Li,Qing Zhao,Wei Zhai,Dan Luo,Tian Yu He,Shuo Liu,Bing Xiang Yang,Guanghui Fu
2024-04-19
Abstract:In the social media, users frequently express personal emotions, a subset of which may indicate potential suicidal tendencies. The implicit and varied forms of expression in internet language complicate accurate and rapid identification of suicidal intent on social media, thus creating challenges for timely intervention efforts. The development of deep learning models for suicide risk detection is a promising solution, but there is a notable lack of relevant datasets, especially in the Chinese context. To address this gap, this study presents a Chinese social media dataset designed for fine-grained suicide risk classification, focusing on indicators such as expressions of suicide intent, methods of suicide, and urgency of timing. Seven pre-trained models were evaluated in two tasks: high and low suicide risk, and fine-grained suicide risk classification on a level of 0 to 10. In our experiments, deep learning models show good performance in distinguishing between high and low suicide risk, with the best model achieving an F1 score of 88.39%. However, the results for fine-grained suicide risk classification were still unsatisfactory, with an weighted F1 score of 50.89%. To address the issues of data imbalance and limited dataset size, we investigated both traditional and advanced, large language model based data augmentation techniques, demonstrating that data augmentation can enhance model performance by up to 4.65% points in F1-score. Notably, the Chinese MentalBERT model, which was pre-trained on psychological domain data, shows superior performance in both tasks. This study provides valuable insights for automatic identification of suicidal individuals, facilitating timely psychological intervention on social media platforms. The source code and data are publicly available.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of suicide risk classification on social media platforms. Specifically: 1. **Lack of Datasets**: There is currently a lack of suicide risk classification datasets specifically for Chinese social media, especially in terms of fine-grained classification. 2. **Insufficient Classification Accuracy**: Existing deep learning methods perform well in coarse-grained classification but are less effective in fine-grained classification. 3. **Class Imbalance**: There is a significant disparity in the number of samples for different suicide risk levels in the dataset, making model training difficult. To address these issues, the researchers proposed a new dataset, SOS-1K, which includes suicide-related data collected from Chinese social media platforms and categorizes it into 11 levels. Additionally, the paper evaluates the performance of seven pre-trained models on two tasks: fine-grained suicide risk classification and high-low risk binary classification. Data augmentation techniques (such as synonym replacement, back-translation, and data generation based on large language models) are used to mitigate the class imbalance problem, thereby improving model performance. Experimental results show that the Chinese MentalBERT model achieved a weighted F1 score of 55.54% in the fine-grained classification task and an F1 score of 88.39% in the high-low risk binary classification task. These findings highlight the advantages of domain-specific pre-trained models.