Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis

Monika Arora,Vineet Kansal
DOI: https://doi.org/10.1007/s13278-019-0557-y
2019-03-18
Social Network Analysis and Mining
Abstract:On social media platforms such as Twitter and Facebook, people express their views, arguments, and emotions of many events in daily life. Twitter is an international microblogging service featuring short messages called “tweets” from different languages. These texts often consist of noise in the form of incorrect grammar, abbreviations, freestyle, and typographical errors. Sentiment analysis (SA) aims to predict the actual emotions from the raw text expressed by the people through the field of natural language processing (NLP). The main aim of our work is to process the raw sentence from the Twitter dataset and find the actual polarity of the message. This paper proposes a text normalization with deep convolutional character level embedding (Conv-char-Emb) neural network model for SA of unstructured data. This model can tackle the problems: (1) processing the noisy sentence for sentiment detection (2) handling small memory space in word level embedded learning (3) accurate sentiment analysis of the unstructured data. The initial preprocessing stage for performing text normalization includes the following steps: tokenization, out of vocabulary (OOV) detection and its replacement, lemmatization and stemming. A character-based embedding in convolutional neural network (CNN) is an effective and efficient technique for SA that uses less learnable parameters in feature representation. Thus, the proposed method performs both the normalization and classification of sentiments for unstructured sentences. The experimental results are evaluated in the Twitter dataset by a different point polarity (positive, negative and neutral). As a result, our model performs well in normalization and sentiment analysis of the raw Twitter data enriched with hidden information.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to deal with noisy texts in Twitter data and conduct accurate sentiment analysis. Specifically, the author proposes a deep convolutional neural network model based on character - level embedding (Conv - char - Emb), aiming to solve the following three problems: 1. **Dealing with noisy sentences for sentiment detection**: Texts on Twitter usually contain noises such as grammar mistakes, abbreviations, free - style writing and spelling errors, which will affect the accuracy of sentiment analysis. Therefore, an effective method is needed to deal with these noisy sentences. 2. **Dealing with the small memory space problem in word - level embedding learning**: Traditional word - level embedding methods require a large vocabulary and memory space, which is especially obvious when dealing with multi - language texts. Character - level embedding can reduce the required memory space and improve the efficiency of the model. 3. **Conducting accurate sentiment analysis on unstructured data**: Unstructured data (such as texts on social media) are usually difficult to process because they lack a unified format and structure. Through character - level embedding and deep convolutional neural network, features can be more effectively extracted and sentiment classification can be carried out. To achieve these goals, the paper proposes a method that includes the following steps: - **Pre - processing stage**: - **Tokenization**: Divide the input text into words or phrases. - **Out - of - vocabulary (OOV) detection and replacement**: Use multiple dictionaries (such as Microsoft Dictionary, SMS Dictionary and Soundex Dictionary) to correct spelling mistakes and non - standard words. - **Lemmatization**: Restore different forms of words to their basic forms. - **Stemming**: Further restore words to their root forms. - **Character - level embedding and deep convolutional neural network (CNN)**: - Use character - level embedding technology to convert texts into vector representations, and then carry out feature extraction and sentiment classification through deep convolutional neural network. Through the above methods, the paper aims to improve the accuracy and efficiency of extracting sentiment from noisy texts, especially when dealing with multi - language and unstructured data. Experimental results show that this model performs well on Twitter data sets and can effectively carry out text normalization and sentiment analysis.