Abstract:BackgroundExploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.ResultsFull-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and merged extractions.ConclusionThis work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

CAZyme3D: a database of 3D structures for carbohydrate-active enzymes

The carbohydrate-active enzymes database (CAZy) in 2013

cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets

dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation

CAZac: an activity descriptor for carbohydrate-active enzymes

The carbohydrate-active enzyme database: functions and literature

The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics

Large-scale computational analyses of gut microbial CAZyme repertoires enabled by Cayman

Carbohydrate-active enzyme annotation in microbiomes using dbCAN

dbCAN2: a meta server for automated carbohydrate-active enzyme annotation

dbCAN-seq update: CAZyme gene clusters and substrates in microbiomes

dbCAN-PUL: a database of experimentally characterized CAZyme gene clusters and their substrates

RCSB protein Data Bank: exploring protein 3D similarities via comprehensive structural alignments

iCAZyGFADB: an insect CAZyme and gene function annotation database

PDBlocal: A Web-Based Tool for Local Inspection of Biological Macromolecular 3D Structures

PubChem3D: a New Resource for Scientists.

CFam: a chemical families database based on iterative selection of functional seeds and seed-directed compound clustering

CS23D: a web server for rapid protein structure generation using NMR chemical shifts and sequence data

PCAS – a Precomputed Proteome Annotation Database Resource

Extracting and connecting chemical structures from text sources using chemicalize.org

3CDB: a manually curated database of chromosome conformation capture data