Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods

Feiming Huang,Qian Gao,XianChao Zhou,Wei Guo,KaiYan Feng,Lin Zhu,Tao Huang,Yu-Dong Cai
DOI: https://doi.org/10.1007/s10930-024-10230-z
2024-09-07
Abstract:Protein solubility is a critical parameter that determines the stability, activity, and functionality of proteins, with broad and far-reaching implications in biotechnology and biochemistry. Accurate prediction and control of protein solubility are essential for successful protein expression and purification in research and industrial settings. This study gathered information on soluble and insoluble proteins. In characterizing the proteins, they were mapped to STRING and characterized by functional and structural features. All functional/structural features were integrated to create a 5768-dimensional binary vector to encode proteins. Seven feature-ranking algorithms were employed to analyze the functional/structural features, yielding seven feature lists. These lists were subjected to the incremental feature selection, incorporating four classification algorithms, one by one to build effective classification models and identify functional/structural features with classification-related importance. Some essential functional/structural features used to differentiate between soluble and insoluble proteins were identified, including GO:0009987 (intercellular communication) and GO:0022613 (ribonucleoprotein complex biogenesis). The best classification model using support vector machine as the classification algorithm and 295 optimized functional/structural features generated the F1 score of 0.825, which can be a powerful tool to differentiate soluble proteins from insoluble proteins.
What problem does this paper attempt to address?