Abstract:Text classification is an important and classic application in natural language processing (NLP). Recent studies have shown that graph neural networks (GNNs) are effective in tasks with rich structural relationships and serve as effective transductive learning approaches. Text representation learning methods based on large-scale pretraining can learn implicit but rich semantic information from text. However, few studies have comprehensively utilized the contextual semantic and structural information for Chinese text classification. Moreover, the existing GNN methods for text classification did not consider the applicability of their graph construction methods to long or short texts. In this work, we propose Chinese-BERTology-wwm-GCN, a framework that combines Chinese bidirectional encoder representations from transformers (BERT) series models with whole word masking (Chinese-BERTology-wwm) and the graph convolutional network (GCN) for Chinese text classification. When building text graph, we use documents and words as nodes to construct a heterogeneous graph for the entire corpus. Specifically, we use the term frequency-inverse document frequency (TF-IDF) to construct the word-document edge weights. For long text corpora, we propose an improved pointwise mutual information (PMI*) measure for words according to their word co-occurrence distances to represent the weights of word-word edges. For short text corpora, the co-occurrence information between words is often limited. Therefore, we utilize cosine similarity to represent the word-word edge weights. During the training stage, we effectively combine the cross-entropy and hinge losses and use them to jointly train Chinese-BERTology-wwm and GCN. Experiments show that our proposed framework significantly outperforms the baselines on three Chinese benchmark datasets and achieves good performance even with few labeled training sets.

Research on parallel corpus classification based on pre-trained model.

Chinese Text Classification Using BERT and Flat-Lattice Transformer.

Research on Dual Channel News Headline Classification Based on ERNIE Pre-training Model

ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

Hybrid Chinese text classification model based on pretraining model

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Improved Chinese Short Text Classification Method Based on ERNIE_BiGRU Model

A Parallel Two-Channel Emotion Classification Method for Chinese Text

A Chinese Text Classification Method Based on BERT and Convolutional Neural Network

Research on Text Classification Based on BERT-BiGRU Model

Short Text Classification Model based on Pre-trained Language Model with Feature Fusion

Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment

Cross-lingual Information Retrieval with BERT

Chinese text classification by combining Chinese-BERTology-wwm and GCN

Empirical Study on Character Level Neural Network Classifier for Chinese Text.

Chinese Text Classification Method Based on BERT Word Embedding

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

An ERNIE-Based Joint Model for Chinese Named Entity Recognition

Chinese text classification method based on sentence information enhancement and feature fusion