Abstract:We constructed the Cardiac Organellar Peptide Atlas Library (COPa library) as a targeted and interactive resource to the cardiovascular community. Annotated peptide spectra are hosted using a relational database in a modular fashion based on species (e.g. human, mouse) and organelles (e.g. mitochondria, proteasome). Within this release of COPa library, a total of 108,268 spectra have been disseminated via two avenues. A web portal was established to navigate the library via parallel set of identifiers, such as protein name, accession number, gene symbol, etc. In parallel, a web-service cyber-infrastructure was engineered to aid the annotation of mass spectra submitted via internet. The large raw spectra files are dissected into small data packages at the local PC before submission. This workflow surpasses the limitation of network bandwidth, as well as enables parallel data submission and search. A benchmark test with 897,327 ms/ms spectra showed the library searching covers 93.4% of proteins identified via database searching, as well as additional 23.9% of proteins at the same level of statistical confidence. In addition, a wiki-like web interface was embedded in the library web portal in order to facilitate the synthesis of consensus knowledge among the cardiovascular community on innovations of functional proteomics. Overall, the COPa library search supports targeted proteomic characterization, which complements database search for exploratory survey. The implementation of the COPa library-based proteomic knowledgebase leverages state-of-the-art technology and annotated datasets among the research community at large. Its application bridges discovery-driven and hypothesis-driven research while fostering translational medicine.

An Online Cluster Analysis Method for Large-Scale Protein Sequences

Gene Sequence Alignment on a Public Computing Platform

Abstract P327: COPa Library: A Proteomic Knowledge Base for Cardiovascular Biology and Medicine

ProteinInferencer: Confident protein identification and multiple experiment comparison for large scale proteomics projects

Exploring large protein sequence space through homology- and representation-based hierarchical clustering

A Novel Alignment-Free Vector Method to Cluster Protein Sequences

Rapid multiple protein sequence search by parallel and heterogeneous computation

Dace: A Scalable Dp-Means Algorithm for Clustering Extremely Large Sequence Data

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Towards Automatic Clustering of Protein Sequences

CD-HIT: accelerated for clustering the next-generation sequencing data

Conversation analysis: a practical resource in the health care setting

A new frequency weighted model reduction technique using balanced singular perturbation approximation

CAPER 3.0: A Scalable Cloud-Based System for Data-Intensive Analysis of Chromosome-Centric Human Proteome Project Data Sets.

Grouping of Amino Acids and Recognition of Protein Structurally Conserved Regions by Reduced Alphabets of Amino Acids

ProtParts, an automated web server for clustering and partitioning protein datasets

Classification of Protein Sequences by a Novel Alignment-Free Method on Bacterial and Virus Families

Fik Model: Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery

OPUS-Design: Designing Protein Sequence from Backbone Structure with 3DCNN and Protein Language Model

Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce