A Long-Text Classification Method of Chinese News Based on BERT and CNN

Xinying Chen,Peimin Cong,Shuo Lv
DOI: https://doi.org/10.1109/access.2022.3162614
IF: 3.9
2022-01-01
IEEE Access
Abstract:Text Classification is an important research area in natural language processing (NLP) that has received a considerable amount of scholarly attention in recent years. However, real Chinese online news is characterized by long text, a large amount of information and complex structure, which also reduces the accuracy of Chinese long text classification as a result. To improve the accuracy of long text classification of Chinese news, we propose a BERT-based local feature convolutional network (LFCN) model including four novel modules. First, to address the limitation of Bidirectional Encoder Representations from Transformers (BERT) on the length of the max input sequence, we propose a named Dynamic LEAD-n (DLn) method to extract short texts within the long text based on the traditional LEAD digest algorithm. In Text-Text Encoder (TTE) module, we use BERT pretrained language model to complete the sentence-level feature vector representation of a news text and to capture global features by using the attention mechanism to identify correlated words in text. After that, we propose a CNN-based local feature convolution (LFC) module to capture local features in text, such as key phrases. Finally, the feature vectors generated by the different operations over several different periods are fused and used to predict the category of a news text. Experimental results show that the new method further improves the accuracy of long text classification of Chinese news.
computer science, information systems,telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?