Enhancing Depressive Post Detection in Bangla: A Comparative Study of TF-IDF, BERT and FastText Embeddings

Saad Ahmed Sazan,Mahdi H. Miraz,A B M Muntasir Rahman
2024-07-12
Abstract:Due to massive adoption of social media, detection of users' depression through social media analytics bears significant importance, particularly for underrepresented languages, such as Bangla. This study introduces a well-grounded approach to identify depressive social media posts in Bangla, by employing advanced natural language processing techniques. The dataset used in this work, annotated by domain experts, includes both depressive and non-depressive posts, ensuring high-quality data for model training and evaluation. To address the prevalent issue of class imbalance, we utilised random oversampling for the minority class, thereby enhancing the model's ability to accurately detect depressive posts. We explored various numerical representation techniques, including Term Frequency-Inverse Document Frequency (TF-IDF), Bidirectional Encoder Representations from Transformers (BERT) embedding and FastText embedding, by integrating them with a deep learning-based Convolutional Neural Network-Bidirectional Long Short-Term Memory (CNN-BiLSTM) model. The results obtained through extensive experimentation, indicate that the BERT approach performed better the others, achieving a F1-score of 84%. This indicates that BERT, in combination with the CNN-BiLSTM architecture, effectively recognises the nuances of Bangla texts relevant to depressive contents. Comparative analysis with the existing state-of-the-art methods demonstrates that our approach with BERT embedding performs better than others in terms of evaluation metrics and the reliability of dataset annotations. Our research significantly contribution to the development of reliable tools for detecting depressive posts in the Bangla language. By highlighting the efficacy of different embedding techniques and deep learning models, this study paves the way for improved mental health monitoring through social media platforms.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of detecting depression - related posts in Bengali. Specifically, the author focuses on identifying users' depressive emotions through social media analysis, especially for under - represented languages like Bengali. The research aims to improve the ability to recognize Bengali depression posts by applying advanced natural language processing techniques. The paper uses a data set annotated by domain experts, including depression and non - depression posts, to ensure high - quality data for model training and evaluation. To address the problem of class imbalance, the author adopts the method of random oversampling to increase the sample size of the minority class, thereby improving the model's ability to accurately detect depression posts. The research explores multiple numerical representation techniques, including term frequency - inverse document frequency (TF - IDF), BERT embeddings and FastText embeddings, and combines these techniques with a deep - learning - based convolutional neural network - bidirectional long - short - term memory (CNN - BiLSTM) model. The experimental results show that the BERT method outperforms other methods in performance, achieving an F1 score of 84%, indicating that the BERT combined with the CNN - BiLSTM architecture can effectively identify depression - related content in Bengali texts. ### Main contributions of the paper: 1. **Effectively handling class imbalance**: By randomly oversampling, the problem of insufficient samples in the minority class in the data set is solved, ensuring that the model performs better when predicting the minority class. 2. **Improving the performance of text representation techniques**: The research shows that although TF - IDF can effectively capture key features, BERT embeddings provide a more comprehensive understanding of the text, especially in capturing the subtle semantics of Bengali depression posts. 3. **Proposing a novel custom - made CNN - BiLSTM model**: This model combines a convolutional neural network (CNN) and a bidirectional long - short - term memory network (BiLSTM), and is able to capture local patterns and long - term dependencies, thereby achieving high - precision prediction of depression posts. ### Formula explanations: - **TF - IDF formula**: - \( \text{TF}(t, d)=\frac{\text{Number of times } t \text{ appears in document } d}{\text{Total number of terms in document } d} \) - \( \text{IDF}(t)=\log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) \) - \( \text{TF - IDF}(t, d)=\text{TF}(t, d)\times \text{IDF}(t) \) Through these methods and techniques, the paper makes an important contribution to the development of reliable tools for detecting depression posts on Bengali social media, which helps to improve mental health monitoring.