Patent-publication pairs for the detection of knowledge transfer from research to industry: reducing ambiguities with word embeddings and references

Klaus Lippert,Konrad U. Förstner
2024-12-02
Abstract:The performance of medical research can be viewed and evaluated not only from the perspective of publication output, but also from the perspective of economic exploitability. Patents can represent the exploitation of research results and thus the transfer of knowledge from research to industry. In this study, we set out to identify publication-patent pairs in order to use patents as a proxy for the economic impact of research. To identify these pairs, we matched scholarly publications and patents by comparing the names of authors and investors. To resolve the ambiguities that arise in this name-matching process, we expanded our approach with two additional filter features, one used to assess the similarity of text content, the other to identify common references in the two document types. To evaluate text similarity, we extracted and transformed technical terms from a medical ontology (MeSH) into numerical vectors using word embeddings. We then calculated the results of the two supporting features over an example five-year period. Furthermore, we developed a statistical procedure which can be used to determine valid patent classes for the domain of medicine. Our complete data processing pipeline is freely available, from the raw data of the two document types right through to the validated publication-patent pairs.
Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to more accurately identify the association between academic publications and patents in order to assess the degree of transformation of research results into economic activities. Specifically, the author hopes to use patents as a proxy indicator for measuring the economic impact of research by matching academic publications and patents, thereby reducing the ambiguity generated during the name - matching process. ### Main problems: 1. **Assessing the economic impact of research**: Traditional research evaluation mainly depends on the number of publications and the acquisition of external funds, but these methods cannot fully reflect the economic value of research results in practical applications. 2. **Reducing the ambiguity of name - matching**: When matching academic publications with patents, since the names of authors and inventors may be homonyms, the matching results are inaccurate. Therefore, additional filtering features need to be introduced to improve the accuracy of matching. ### Solutions: To solve the above problems, the author proposes the following improvement measures: - **Text content similarity comparison**: Convert medical terms (such as terms in the MeSH vocabulary) into numerical vectors and calculate text similarity using word embeddings technology. Specifically, use the BERT model to generate the vector representation of each document and evaluate the content similarity of two documents by cosine similarity. \[ \text{Cosine Similarity}=\frac{\mathbf{A}\cdot\mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|} \] - **Co - cited literature analysis**: Analyze the co - cited literature in patents and publications to further confirm whether they involve the same research topic. This helps to reduce many - to - many relationships and improve the accuracy of matching. - **Patent category filtering**: Develop a statistical method to automatically select effective patent categories (IPC) to ensure that the matched patents and publications belong to the same field. Through Q - Q plot analysis of the patent category distributions of different subsets, the patent categories suitable for the medical field are determined. ### Summary: The core objective of this paper is to more accurately assess the ability of research results to transform into economic activities by improving the matching method between academic publications and patents. By introducing innovative means such as text similarity, co - cited literature analysis, and patent category filtering, the author has successfully reduced the ambiguity in the matching process and provided a reproducible data processing flow.