Identifying patent classification codes associated with specific search keywords using machine learning

Wan Mohammad Faris Zaini,Daphne Teck Ching Lai,Ren Chong Lim
DOI: https://doi.org/10.1016/j.wpi.2022.102153
2022-11-04
World Patent Information
Abstract:The purpose of this research is to retrieve relevant patent documents and identify classification codes and search keywords that best characterize a given technological domain found in patent literature. The World Intellectual Property Organization (WIPO) recorded a rising number of patent applications filed under the Patent Cooperation Treaty (PCT) which is becoming the norm for filing patents in multiple jurisdictions. As such, PCT documents are a valuable source of information related to innovation activities with some degree of entrepreneurial intention. However, searching for relevant patent documents can be a daunting and uncertain process. We constructed a high-dimensional matrix consisting of two data types: classification codes and search keywords known as the code-keyword matrix. In turn, two machine learning algorithms called principal components analysis (PCA) and k -means clustering were used to derive insights from the high-dimensional dataset. Consequently, a two-dimensional PCA biplot and clustering on an optimized PCA dataset called Eigen-PCA were obtained using our combined machine learning method. Using such algorithms, we were able to identify correlation relationships found between the two data types. We also clustered the classification codes by least-relevance, medium-relevance, and high-relevance for the domain of anti-corrosion technologies, an impactful area for steel infrastructure in maritime environments. Such patent data analytics can be adapted to other areas such as medical technologies, green energy transition towards Net Zero and conservation of biological diversity.
What problem does this paper attempt to address?