Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec

M. Umar,Haisal Dauda Abubakar
DOI: https://doi.org/10.56471/slujst.v4i.266
2022-08-20
Abstract:In Sentiment Analysis, there are three (3) approaches namely, machine learning, lexicon-based and ruled based approaches. This study investigates on machine learning approaches which involves text vectorization or word embedding- an essential step in natural language processing tasks since most machine learning algorithms work with numerical input. Text vectorization involves the representation or mapping of words or documents of a corpus to numerical vectors of numbers or real numbers. There are several approaches in the literatures on document/text representation, however this study will focus on three (3) commonly used ones viz; Bag of words, TF-IDF, word2vec and doc2vec, and try to identify the reason behind that for review and recommendation to the researchers in hurry. Review of this study shows that TF-IDF feature vector representations generally outperforms other two (2) vectorization methods word2vec and doc2vec, specifically in book review sentiment classification. And therefore recommended for future studies in book review data set
Computer Science
What problem does this paper attempt to address?