Research on parallel corpus classification based on pre-trained model.

Yongliang Huang,Shulin Yang,Meiqi Zhou,Jiao Peng,Xiang Li
DOI: https://doi.org/10.1145/3565291.3565328
2022-01-01
Abstract:At present, the data in most text classification tasks are only in a single language, but the bilingual text information value can be fully utilized in the scenario of Chinese-English parallel corpus. A classification model combining the text features of pre-training model ERNIE and BERT is proposed. ERNIE is used to process The Chinese corpus, and BERT is used to process the English corpus.TextCNN is used to fuse text feature vectors.Thus, the classification effect of parallel corpus can be improved.Comparative experimental tests were performed on the data set.The results show that this method has better classification effect in parallel corpus.
What problem does this paper attempt to address?