$\texttt{PatentAgent}$: Intelligent Agent for Automated Pharmaceutical Patent Analysis

Xin Wang,Yifan Zhang,Xiaojing Zhang,Longhui Yu,Xinna Lin,Jindong Jiang,Bin Ma,Kaicheng Yu
2024-10-26
Abstract:Pharmaceutical patents play a vital role in biochemical industries, especially in drug discovery, providing researchers with unique early access to data, experimental results, and research insights. With the advancement of machine learning, patent analysis has evolved from manual labor to tasks assisted by automatic tools. However, there still lacks an unified agent that assists every aspect of patent analysis, from patent reading to core chemical identification. Leveraging the capabilities of Large Language Models (LLMs) to understand requests and follow instructions, we introduce the $\textbf{first}$ intelligent agent in this domain, $\texttt{PatentAgent}$, poised to advance and potentially revolutionize the landscape of pharmaceutical research. $\texttt{PatentAgent}$ comprises three key end-to-end modules -- $\textit{PA-QA}$, $\textit{PA-Img2Mol}$, and $\textit{PA-CoreId}$ -- that respectively perform (1) patent question-answering, (2) image-to-molecular-structure conversion, and (3) core chemical structure identification, addressing the essential needs of scientists and practitioners in pharmaceutical patent analysis. Each module of $\texttt{PatentAgent}$ demonstrates significant effectiveness with the updated algorithm and the synergistic design of $\texttt{PatentAgent}$ framework. $\textit{PA-Img2Mol}$ outperforms existing methods across CLEF, JPO, UOB, and USPTO patent benchmarks with an accuracy gain between 2.46% and 8.37% while $\textit{PA-CoreId}$ realizes accuracy improvement ranging from 7.15% to 7.62% on PatentNetML benchmark. Our code and dataset will be publicly available.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is how to efficiently and accurately analyze and extract key information from patents in drug development. Specifically, the paper points out several major issues in current drug patent analysis: 1. **Manual methods are time-consuming and labor-intensive**: Traditional methods such as manual review and keyword search, although considered the gold standard for patent analysis, require scientists to spend a significant amount of time and effort to extract information. They also rely on human experts to interpret complex chemical information, which is costly and inefficient. 2. **Existing computational tools lack an overall solution**: Current computational tools such as text mining and chemical structure exploration can independently accomplish certain tasks but lack a unified standard and integration. This makes coordination between multiple modules difficult, especially for researchers without a computer science background, posing obstacles to the use of these tools. 3. **Inaccurate identification of core compounds**: In drug patents, identifying the core compound structure from hundreds of chemical substances is an important task. However, the accuracy of existing tools for this task remains low, even approaching the level of random guessing. To address these issues, the paper proposes an intelligent agent system named **PatentAgent**, which aims to achieve full-process automated analysis from patent reading to core chemical structure identification by integrating large language models (LLMs) and other advanced computational methods. PatentAgent includes three main modules: 1. **PA-QA**: A question-answering chatbot capable of accurately responding to users' natural language queries about patents. 2. **PA-Img2Mol**: A deep learning model ensemble that can convert chemical structure images into molecular expressions (SMILES). 3. **PA-CoreId**: A machine learning classifier that can identify core chemical structures from various chemical substances. Through the collaborative work of these modules, PatentAgent can significantly improve the accuracy and efficiency of drug patent analysis, reducing the time and effort required from researchers.