Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation

Devi, V. Sharmila
DOI: https://doi.org/10.1007/s11063-022-10898-3
IF: 2.565
2022-07-16
Neural Processing Letters
Abstract:The increasing use of social media to communicate is an emerging trend compared to traditional phone calls and SMS. WhatsApp is one of the popular social messaging applications used in India. Identification of demographic features of authors in social media is known as author profiling. Author profiling is helpful for many applications such as forensics, security and marketing. Author profiling helps to identify fake profiles in social media. By analysing their WhatsApp messages in code-mixed Tamil, this paper focuses on identifying the socio-demographic appearance of author traits or features such as gender, age-group, marital and education status. Even though many studies have been conducted on Author Profiling in English and other resources-rich languages, the research on the Indian language is still nascent. This study is the first Author Profiling task for code-mixed Tamil on WhatsApp. As a part of this study, we have created the benchmark WhatsApp dataset in code-mixed Tamil language to develop the author profiling system. We propose a stacked Convolutional Network (CNN) combined with k-max pooling and Bidirectional Long Short Term Memory (BiLSTM) to enhance the classification performance of CNN. Multiple experiments have been conducted to demonstrate the effectiveness of the proposed model, including the comparison against existing models with diverse parameter settings. We have also incorporated Focal loss and context embedding based data augmentation to handle the data imbalance. The proposed model outperforms state-of-the-art deep learning models with a better performance.
computer science, artificial intelligence
What problem does this paper attempt to address?