Show Criminals’ True Color: Chinese Variant Toxic Text Restoration Based on Pointer-Generator Network

Li Wen,Pengfei Xue,Yi Shen,Wanmeng Ding,Min Zhang
DOI: https://doi.org/10.1007/978-981-97-5606-3_12
2024-01-01
Abstract:Chinese toxic text detection is an important task for reducing harmful activities in social networks. Yet, criminals are increasingly using variant characters that change the form of the text, thus enabling the dissemination of illegal information. However, existing models do not identify these variant toxic texts well due to the significantly disrupted semantic information and the interference with machine comprehension. The abuse of variant texts poses a new challenge for governance. To bridge this gap, we propose the Chinese Variant Text Restoration (CVTR) task, which aims to restore the variant texts to their original form. We frame the task as an end-to-end translation task, which is similar to the human understanding process of the variant toxic text: reading, reconstructing, and comprehending. In this paper, we propose a novel approach for restoring Chinese Variant Toxic Texts (CVTT) using the pointer-generator network (PGN). First, we analyze the characteristics of the CVTT and conclude eight commonly used variant methods. Second, we introduce the pointer-generator network to help the sequence-to-sequence model restore the primitive semantics. We use the ROUGE and BLEU to evaluate our proposed method and we evaluate the effectiveness of our approach at the semantic level and character level respectively. Our work explains the mystery of how criminals spread harmful information.
What problem does this paper attempt to address?