Connecting the Dots: Inferring Patent Phrase Similarity with Retrieved Phrase Graphs

Zhuoyi Peng,Yi Yang
2024-03-25
Abstract:We study the patent phrase similarity inference task, which measures the semantic similarity between two patent phrases. As patent documents employ legal and highly technical language, existing semantic textual similarity methods that use localized contextual information do not perform satisfactorily in inferring patent phrase similarity. To address this, we introduce a graph-augmented approach to amplify the global contextual information of the patent phrases. For each patent phrase, we construct a phrase graph that links to its focal patents and a list of patents that are either cited by or cite these focal patents. The augmented phrase embedding is then derived from combining its localized contextual embedding with its global embedding within the phrase graph. We further propose a self-supervised learning objective that capitalizes on the retrieved topology to refine both the contextualized embedding and the graph parameters in an end-to-end manner. Experimental results from a unique patent phrase similarity dataset demonstrate that our approach significantly enhances the representation of patent phrases, resulting in marked improvements in similarity inference in a self-supervised fashion. Substantial improvements are also observed in the supervised setting, underscoring the potential benefits of leveraging retrieved phrase graph augmentation.
Computer Science
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in the patent phrase similarity inference task. Specifically, due to the use of legal and highly technical languages in patent documents, existing semantic text similarity methods based on local context information (such as Sentence - BERT or SimCSE) perform poorly in inferring patent phrase similarity. In addition, obtaining a large number of expert annotations for supervised training also faces great challenges, because it is not only costly but also requires in - depth knowledge in the patent field. To solve these problems, the authors propose a retrieval - augmented graph augmentation method, aiming to effectively capture the representations of patent phrases. By introducing global context information, this method significantly improves the effect of patent phrase similarity inference. The following are the main innovation points of this method: 1. **Constructing Phrase Graphs**: For each patent phrase, construct a phrase graph that includes its core patent and the patents that cite or are cited by it. This makes the representation of a phrase depend not only on local context information but also on global context information. 2. **Self - Supervised Learning Objective**: Utilize the topological structure of the phrase graph to design a self - supervised learning objective to optimize text context embeddings and graph parameters simultaneously in an end - to - end manner. This method effectively solves the problem of label scarcity. 3. **Experimental Verification**: Experiments were carried out on a unique patent phrase similarity dataset, and the results show that this method significantly enhances the representation of patent phrases and achieves significant improvements in both self - supervised and supervised settings. ### Formula and Symbol Explanation - \( V \) represents the set of patents, where the \( i \) - th patent is represented as \( v_i\in V \). - \( U \) represents the set of phrases, where the \( j \) - th phrase is represented as \( u_j\in U \). - \( E_c\in\mathbb{R}^{N\times N} \) is an adjacency matrix representing the citation relationships between patents, \( E_c(i, j) = 1 \) indicates that patent \( v_i \) cites patent \( v_j \), and 0 otherwise. - \( E_r\in\mathbb{R}^{N\times M} \) is an adjacency matrix representing the relationships between patents and phrases, \( E_r(i, j) = 1 \) indicates that phrase \( u_j \) appears in patent \( v_i \), and 0 otherwise. - \( G_u=(U_u, V_u, E_r^u, E_c^u) \) represents the ego - graph (local sub - graph) of phrase \( u \). - \( f(u)\in\mathbb{R}^d \) represents the text embedding of phrase \( u \). - \( g(G_u)\in\mathbb{R}^d \) represents the ego - graph embedding of phrase \( u \). - \( \phi(u)=f(u)\oplus g(G_u) \) represents the retrieval - augmented phrase embedding, where \( \oplus \) represents element - wise addition. ### Conclusion The method proposed in this paper significantly improves the effect of patent phrase similarity inference by introducing global context information and self - supervised learning. The experimental results show that this method outperforms existing methods in both self - supervised and supervised settings, demonstrating its potential application value in the field of patent analysis.