Research of Text Classification Based on TF-IDF and CNN-LSTM

Hai Zhou
DOI: https://doi.org/10.1088/1742-6596/2171/1/012021
2022-01-01
Journal of Physics: Conference Series
Abstract:Abstract With the rapid development of deep learning, many deep learning models have been widely used in Natural Language Processing(NLP). The Long-Short Term memory network(LSTM) model and convolutional neural network(CNN) model can achieve high classification accuracy in text classification tasks. However, the high input dimension of text features and the need to train a large number of parameters in the deep learning model often take a lot of time. This paper uses Term Frequency-inverse Document Frequency(TF-IDF) to remove features with lower weights, extract key features in the text, extract the corresponding word vector through the Word2Vec model, and then input it into the CNN-LSTM model. We compared the model with CNN, LSTM, and LSTM-attention methods and found that the model can significantly reduce model parameters and training time in short and long text data sets. The model hardly loses accuracy in the long text, but the model will lose a certain amount of accuracy in short texts. This paper also proposes fusing original text features to make up for the accuracy loss caused by the TF-IDF feature extraction method.
What problem does this paper attempt to address?