Hierarchical multi-instance multi-label learning for Chinese patent text classification

Yunduo Liu,Fang Xu,Yushan Zhao,Zichen Ma,Tengke Wang,Shunxiang Zhang,Yuhao Tian
DOI: https://doi.org/10.1080/09540091.2023.2295818
2024-01-04
Connection Science
Abstract:To further enhance the accuracy of the Chinese patent classification, this paper proposes a model, based on the patent structure and takes the patent claim as subjects, with multi-instance multi-label learning as the main method. Firstly, the patent claims are divided into multiple independent texts using the sequence number as the splitting token. For each patent, multiple claims are regarded as multiple instances, and the corresponding IPCs serve as its multiple labels. Next, the concept of secondary_label is introduced following the composition rules of IPC, and the relationships between instances and multiple secondary_labels are mined through the construction of fully-connected layers. To capture more comprehensive semantic information of instances, BIGRU and self-attention are employed to enhance semantics and reduce information loss during the training process. Finally, the max-pooling operations are utilised to obtain the predicted categories of patents based on capturing the relationships between instances and different hierarchical labels. Experimental results on the '2017 Chinese patent dataset' demonstrate that the multi-instance multi-label approach can effectively mine deeper relationships between patents and labels in classification tasks. As a result, our model significantly improves the accuracy of patent text classification.
computer science, artificial intelligence, theory & methods
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to improve the accuracy of Chinese patent text classification by proposing a Hierarchical Multi-Instance Multi-Label (HMM) learning method. Specifically, the paper attempts to address the following key issues: 1. **Utilizing Technical Information in Patent Applications**: - Most current patent classification methods overlook the technical information contained in the patent description, especially the claims section. This paper treats the patent claims as independent texts to extract technical information, thereby improving classification accuracy. 2. **Considering the Structure of IPC**: - The International Patent Classification (IPC) has a hierarchical structure, but existing research often ignores this aspect. This paper introduces the concept of "secondary_label" and utilizes the hierarchical structure of IPC to establish deeper connections between patent texts and labels, further enhancing classification precision. 3. **Enhancing Semantic Representation Capability**: - During training, the use of Bidirectional Gated Recurrent Units (Bi-GRU) and self-attention mechanisms enhances the semantic representation capability of the input text, reducing information loss. ### Research Contributions 1. **Establishing the Relationship Between Claims and IPC**: - A relationship mining model is proposed, which constructs a fully connected layer to establish the association between claims and IPC, thereby obtaining a more comprehensive connection between instances and labels. 2. **Proposing a Hierarchical Multi-Instance Multi-Label Learning Framework**: - By utilizing the hierarchical structure of IPC, the framework improves the accuracy of predicting higher-level categories through the association of information at the current level. Through the above methods, this paper not only enriches the research content of patent text classification but also establishes deeper connections between patent texts and labels, contributing to the effectiveness of patent text classification.