New Word Detection Based on Branch Entropy-Segmentation Probability Model

ZHU Yuying,GUO Yan,WAN Yizhao,TIAN Kai
DOI: https://doi.org/10.11896/jsjkx.220700074
2023-01-01
Computer Science
Abstract:As a basic task of Chinese natural language processing, new word detection is crucial for improving the performance of various downstream tasks. This paper proposes a new word detection method based on branch entropy and segmentation probability. The method firstly generates a candidate word set from the text based on branch entropy, and then calculates the segmentation probability of each candidate, so as to filter out the noisy word candidates. Two different models are proposed to respectively deal with situations whether or not there are annotated corpus related to the text to be processed. In the absence of related segmented corpus, the multi-criteria Transformer-CRF model is trained using general segmented benchmark data sets. A key-value based memory neural network is introduced to fully extract the wordhood information if there is field-specific segmented corpus. Experimental results show that the multi-criteria Transformer-CRF model has a MAP of 54.00% of legal texts in the top 900 resulted words, which is 2.15% higher than that of the unsupervised method. As with segmented legal corpus, the performance of the key-value memory neural network further exceeds the former model, has an improvement of 3.43%.
What problem does this paper attempt to address?