Abstract:BackgroundExploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.ResultsFull-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and merged extractions.ConclusionThis work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

Systematic Extraction of Analogue Series from Large Compound Collections Using a New Computational Compound–Core Relationship Method

Cheminformatic Analysis of Core-Atom Transformations in Pharmaceutically Relevant Heteroaromatics

Prompt Engineering for Transformer-based Chemical Similarity Search Identifies Structurally Distinct Functional Analogues

Similarity based functionalization for enumeration of synthetically plausible chemical libraries surrounding a target

Customizable Generation of Synthetically Accessible, Local Chemical Subspaces

Identification of bioactive compounds with popular single-atom modifications: Comprehensive analysis and implications for compound design

DeLA-Drug: A Deep Learning Algorithm for Automated Design of Druglike Analogues

Optimizing substructure search: a novel approach for efficient querying in large chemical databases

Expanding Chemical Frontiers: Approaches for Generating Diverse and Bioactive Natural Product‐Like Compounds Libraries from Extracts

Automated Chemical Reaction Extraction from Scientific Literature

Learning to Plan Chemical Syntheses

(Semi-) Automatic Review Process for Common Compound Characterization Data in Organic Synthesis

Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model

Improving the chemical profiling of complex natural extracts by joint 13C NMR and LC-HRMS2 analysis and the querying of in silico generated chemical databases

"DompeKeys": a set of novel substructure-based descriptors for efficient chemical space mapping, development and structural interpretation of machine learning models, and indexing of large databases

Structure elucidation of small organic molecules by contemporary computational chemistry methods

Finding relevant retrosynthetic disconnections for stereocontrolled reactions

Rxn-INSIGHT: fast chemical reaction analysis using bond-electron matrices

Extracting and connecting chemical structures from text sources using chemicalize.org

DrugSynthMC: an atom based generation of drug-like molecules with Monte Carlo Search

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product