Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Liqiang Yu,Bo Liu,Qunwei Lin,Xinyu Zhao,Chang Che
2024-01-06
Abstract:In the realm of patent document analysis, assessing semantic similarity between phrases presents a significant challenge, notably amplifying the inherent complexities of Cooperative Patent Classification (CPC) research. Firstly, this study addresses these challenges, recognizing early CPC work while acknowledging past struggles with language barriers and document intricacy. Secondly, it underscores the persisting difficulties of CPC research.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the issue of semantic similarity matching between phrases in patent document analysis, particularly in the application of the Cooperative Patent Classification (CPC) system. Early research, while laying the foundation for CPC, also exposed some limitations such as language barriers, lack of precision, and the complexity of handling patent documents. In recent years, despite some progress made using deep learning techniques, challenges remain in model scalability and data processing. To overcome these challenges and enhance the functionality of the CPC system, this paper proposes two key innovations: 1. **Integrated Approach**: An integrated framework is introduced, combining four BERT-based models (including DeBERTaV3), which improves the accuracy of semantic similarity assessment through weighted averaging. 2. **Novel Text Preprocessing Method**: A novel text preprocessing method specifically tailored for patent documents is designed, employing a unique input structure and scoring each token, which helps capture semantic relationships in the CPC context. Experimental results indicate that both the integrated model and the new text processing strategy demonstrate significant effectiveness when applied to the US patent phrase matching dataset. Specifically, the integrated model achieved a score of 0.8534 in cross-validation, showcasing its potential in enhancing the measurement of semantic similarity in patent documents.