NETWORK ANALYSIS OF THE ORGANIC CHEMISTRY IN PATENTS, LITERATURE, AND PHARMACEUTICAL INDUSTRY

Thierry Kogej,Emma Svensson,Emma Rydholm,Tomas Bastys,Christos Kannas,Mikhail Kabeshov,Samuel Genheden,Ola Engkvist
DOI: https://doi.org/10.26434/chemrxiv-2024-h4qlt
2024-10-03
Abstract:Chemical reactions can be connected in large networks such as knowledge graphs. In this way, prior work has been able to draw meaningful conclusions about the structures and properties of the included organic chemistry. However, the research has focused on public sources of organic chemistry that might lack the intricate details of the synthesis routes used in in-house drug discovery. In this work, we expand on previous analyses to also include an in-house electronic lab notebook (ELN), such that important differences between the network architectures can be investigated. Three chemical reaction knowledge graphs were constructed from US Patent and Trademark Office (USPTO), Reaxys, and an in-house ELN, respectively. The three knowledge graphs were compared. We found that the Reaxys knowledge graph is the most interconnected, whereas the USPTO and ELN knowledge graphs appear more arranged around a few central nodes. These differences might be attributed to the different origins of the data in the three sources.
Chemistry
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to compare knowledge graphs of organic chemical reactions from different data sources (patents, literature, and internal laboratory notebooks from the pharmaceutical industry) using network analysis methods. Specifically, the researchers hope to: 1. **Extend Previous Research**: Previous studies have mainly focused on publicly available organic chemical reactions, which may lack detailed information on the complex synthetic routes used in the drug discovery process within pharmaceutical companies. This study extends the analysis of these reaction networks by including data from internal electronic laboratory notebooks (ELNs). 2. **Compare Knowledge Graphs from Different Sources**: The researchers constructed three different knowledge graphs based on data from Reaxys, the United States Patent and Trademark Office (USPTO), and internal ELNs, and conducted detailed network analysis on these graphs. 3. **Explore Network Characteristics**: By analyzing the connectivity, core nodes, central nodes, path lengths, and other characteristics of these knowledge graphs, the researchers aim to reveal the differences between different data sources and their impact on synthetic prediction modeling. 4. **Propose Hypotheses**: The researchers hypothesize that these differences may be due to the distinct characteristics of the data sources and discuss the potential impact of these differences on synthetic prediction modeling. ### Main Research Content - **Data Sources**: The researchers extracted reaction data from Reaxys, USPTO, and internal ELNs. - **Knowledge Graph Construction**: Using the same Extract-Transform-Load (ETL) pipeline, these data were transformed into knowledge graphs. - **Network Analysis**: Detailed network analysis was conducted on these knowledge graphs by calculating metrics such as degree distribution, average shortest path length, centrality, and clustering coefficient. - **Results Comparison**: The differences in connectivity, core nodes, central nodes, and other aspects of the knowledge graphs from different sources were compared, and the reasons for these differences were explored. ### Research Significance Through this study, the researchers hope to better understand the performance of different data sources in organic chemical reaction networks, thereby providing more valuable references for future synthetic prediction modeling. In particular, this study helps to reveal the differences between internal laboratory data and publicly available data, which is of great significance for drug discovery in the pharmaceutical industry.