Chinese Packaging Product Entity Recognition Based on Bidirectional GRU-CRF
Yibin LI,Huanhuan ZHANG
DOI: https://doi.org/10.14135/j.cnki.1006-3080.20180407001
2019-01-01
Abstract:With a prevailing trend of packaging industry, there is a diversity in the naming conventions of packaging products. So named entity recognition (NER) of these products becomes a necessity for packaging information extraction. Statistically speaking, Chinese product names are characterized by a complex composition and a long length, which makes them more complex and difficult to recognize in textual corpora. Based on thorough research of current algorithms of NER, we introduce a Chinese packaging products NER method using bidirectional GRUCRF model. GRU which is the abbreviation of gated recurrent unit, is an improved structure of hidden layer nodes in recurrent neural network (RNN). In this model, a bidirectional GRU network is used to store and represent contextual semantic information of word, while CRF is responsible for modeling the probability of transition within output word label sequence. From packaging vertical website, we gather textual documents, such as news report, announcements and regulations, and obtain word vectors as pre-trained distributed representation of domain glossary. After automatic labeling of product entities in text data, word sequences are sent to the model in the form of vector. Finally, the best labeling sequence is generated to highlight product entities in the sentence. We evaluate our model using Chinese packaging corpus with other classical models and state-of-the-art RNN models. According to the result, 收稿日期:2018-04-11 作者简介:李一斌(1992-),男,上海人,硕士生,研究方向为知识图谱。E-mail:lyberri@126.com 通信联系人:张欢欢, E-mail:hzhang@ecust.edu.cn our model achieves a precision rate of 82.45% and a recall rate of 80.38%. By conducting another series of contrast experiments on different length of input vectors selected in the forms of both word-level representation and charlevel representation, we find that word-level representation fits better on the corpus and model used here. Meanwhile, this method leads to less artificial feature engineering work than traditional machine learning models, such as CRF, HMM etc. So it can be concluded that the bidirectional CRF-GRU method with word-level distributed representation, which is introduced in this paper, is more suitable for Chinese packaging product recognition task. Keyword: NER; bidirectional GRU; CRF; packaging product; deep learning 智能化是继数字化和网络化之后新一代信息 技术发展的重要方向,随着信息技术的不断发展, 包装产业也进入了高度智能化时代。在包装产业高 度智能化的环境下,大量的产品说明、用户手册等 都是以电子文档的形式呈现。为了给用户提供更好、 更人性化的服务,就必须结合用户的个性化需求, 从多而杂的信息中找到有价值的商业信息。然而在 包装产业领域,产品名的构成复杂、长度较长,如 “防静电透明 PVC 板棒”、“双通道连卷背心袋机” 等,这种结构使得产品实体识别比一般的实体识别 更为复杂和困难。为了充分挖掘包装产业信息中的 价值,并为接下来的包装产业知识图谱构建、包装 产业智能问答等基础应用打下良好的数据基础,包 装产品实体识别是必不可少的步骤。