A Context-Enhanced Transformer with Abbr-Recover Policy for Chinese Abbreviation Prediction

Kaiyan Cao,Deqing Yang,Jingping Liu,Jiaqing Liang,Yanghua Xiao,Feng Wei,Baohua Wu,Quan Lu
DOI: https://doi.org/10.1145/3511808.3557074
2022-01-01
Abstract:Chinese abbreviation prediction is very important for various natural language processing tasks such as query understanding and entity linking, since people tend to use the concise abbreviation rather than the full form (name) to mention an entity. The existing models achieve their predictions through sequence labeling, i.e., the binary classification for each character (token) of the full form. However, they only leverage the semantics of the entity itself, overlooking the label dependencies between the tokens, and the rich information of the entity-related texts. In this paper we proposed a Context-Enhanced Transformer with Abbr-Recover policy, namely CETAR, for Chinese abbreviation prediction. CETAR predicts the abbreviation sequence mainly through an iterative decoding process, of which each round consists of an abbreviation and recovery operation. Our extensive experiments upon both general field and specific domain datasets justify that CETAR outperforms the state-of-the-art baselines including sequence labeling models and sequence generation models. Moreover, we have successfully constructed a Chinese abbreviation dataset from the famous tour website Fliggy, and we also shared it at https://github. com/tolerancecky/abbr-0731. The online A/B test on the Fliggy search system shows that 2.03% of conversion rate improvement has been achieved with the predicted abbreviations.
What problem does this paper attempt to address?