Abstract:X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed 'PredPPCrys' using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.

ProtParts, an automated web server for clustering and partitioning protein datasets

Fast-Part: Fast and Accurate Data Partitioning for Biological Sequence Analysis

AggreProt: a web server for predicting and engineering aggregation prone regions in proteins

SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity

Prediction of membrane protein types in a hybrid space.

Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features

Revealing data leakage in protein interaction benchmarks

Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique.

Privacy-Preserving Multi-Center Differential Protein Abundance Analysis with FedProt

HITS-PR-HHblits: Protein Remote Homology Detection by Combining PageRank and Hyperlink-Induced Topic Search

Towards Automatic Clustering of Protein Sequences

PredPPCrys: Accurate Prediction of Sequence Cloning, Protein Production, Purification and Crystallization Propensity from Protein Sequences Using Multi-Step Heterogeneous Feature Fusion and Selection

ProtDec-LTR2.0: an Improved Method for Protein Remote Homology Detection by Combining Pseudo Protein and Supervised Learning to Rank

dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation

PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels

ProtPlat: an efficient pre-training platform for protein classification based on FastText

ModFOLD9: a web server for independent estimates of 3D protein model quality

Exploring large protein sequence space through homology- and representation-based hierarchical clustering

PredHS: a Web Server for Predicting Protein–protein Interaction Hot Spots by Using Structural Neighborhood Properties

ProteomeExpert: a Docker image-based web server for exploring, modeling, visualizing and mining quantitative proteomic datasets

PSSP-RFE: Accurate Prediction of Protein Structural Class by Recursive Feature Extraction from PSI-BLAST Profile, Physical-Chemical Property and Functional Annotations