Abstract:This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.

CDD/SPARCLE: the conserved domain database in 2020

CDD/SPARCLE: Functional Classification of Proteins Via Subfamily Domain Architectures

The conserved domain database in 2023

CDD: NCBI's conserved domain database

CDD: Conserved Domains and Protein Three-Dimensional Structure

Cdd: A Conserved Domain Database for the Functional Annotation of Proteins

NCBI's Conserved Domain Database and Tools for Protein Domain Analysis

CD-Search: protein domain annotations on the fly

The RCSB protein data bank: integrative view of protein, gene and 3D structural information

The Celera Discovery System

RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences

RCSB Protein Data Bank: Tools for Visualizing and Understanding Biological Macromolecules in 3D.

Worldwide Protein Data Bank Biocuration Supporting Open Access to High-Quality 3D Structural Biology Data

WDSPdb: an Updated Resource for WD40 Proteins

PubMed Text Similarity Model and Its Application to Curation Efforts in the Conserved Domain Database.

RCSB Protein Data Bank: Enabling biomedical research and drug discovery

RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy

3CDB: a manually curated database of chromosome conformation capture data

RCSB Protein Data Bank: supporting research and education worldwide through explorations of experimentally determined and computationally predicted atomic level 3D biostructures

Facilities that make the PDB data collection more powerful

RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning