An Improved Branch Entropy Based Method for Chinese New Words Detection.

Yan Guo,Yuying Zhu,Mingyang Hu,Shuiyuan Ding
DOI: https://doi.org/10.1109/IJCNN54540.2023.10191052
2023-01-01
Abstract:To improve the performance of Chinese new words detection, this paper enhances the traditional method through two aspects. Firstly, a new and more effective metric for branch entropy is chosen through an extensive empirical study of different metrics; besides, punctuation are classified to two categories and assigned with different weights according to their contribution to differentiating new words. After the first step, a set of candidate new words are generated. Secondly, a multi-criteria Transformer-CRF segmentation probability model is introduced to filter the noisy candidates. The new method is tested on SIGHAN 2005 and 2008 Bakeoff data sets, and the performance demonstrates that for the resulted top 1000 similar to 100 words, our new metric contributes to 2.8% similar to 9.05% improvement compared with the commonly used minimum entropy metric; while punctuation processing leads to 3.26% similar to 3.77% improvement. Experiments on the legal corpus show that the Transformer-CRF model's top 1000 MAP results are further improved by 2.34% comparing with the unsupervised method.
What problem does this paper attempt to address?