Abstract:BackgroundExploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.ResultsFull-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and merged extractions.ConclusionThis work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes

Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes

Computational Discovery of Transition-metal Complexes: From High-throughput Screening to Machine Learning

Modern Semiempirical Electronic Structure Methods and Machine Learning Potentials for Drug Discovery: Conformers, Tautomers, and Protonation States

Applying Large Graph Neural Networks to Predict Transition Metal Complex Energies Using the tmQM_wB97MV Dataset

Toward AI/ML-assisted Discovery of Transition Metal Complexes

Graph neural networks for predicting metal–ligand coordination of transition metal complexes

DigiMOF: A Database of Metal-Organic Framework Synthesis Information Generated via Text Mining

MISATO - Machine Learning Dataset for Structure-Based Drug Discovery

DART: Unlocking Coordination Chemistry Beyond the Cambridge Structural Database

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

Exploiting Ligand Additivity for Transferable Machine Learning of Multireference Character Across Known Transition Metal Complex Ligands

Natural language processing in text mining for structural modeling of protein complexes

SC1MC-2022: A database of transition metal complexes for training ML models to predict one-site entropies and mutual information

Fine-tuning Large Language Models for Chemical Text Mining

Ligand additivity relationships enable efficient exploration of transition metal chemical space

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

From Data to Chemistry: Revealing Causality and Reaction Coordinates through Interpretable Machine Learning in Supramolecular Transition Metal Catalysis

Extracting and connecting chemical structures from text sources using chemicalize.org

Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space