Sequence-Based Random Projection Ensemble Approach to Identify Hotspot Residues from Whole Protein Sequence.

Peng Chen,ShanShan Hu,Bing Wang,Jun Zhang
DOI: https://doi.org/10.1007/978-3-319-22186-1_37
2015-01-01
Abstract:Hot spot residues of proteins are key to performing specific functions in many biological processes. However the identification of hot spots by experimental methods is costly and time-consuming. Computational method is an alternative to identify hot spots by using sequential and structural information. However, structural information of protein is not always available. In this paper, the issue of identifying hot spots is addressed by using statistically physicochemical properties of amino acids only. Firstly, 34 relatively independent physicochemical properties are extracted from the 544 properties in AAindex1. Since the hot spots data set is extremely imbalanced, the ratio of the number of hot spots to that of non-hot spots is about 1.4 %, the hot spot set and a set of non-hot spot subset with roughly the number of that hot spots forms an initial input matrix. Random projection on the matrix achieves an input to a REPTree classifier. Several random projections and different sets of non-hot spots build an ensemble REPTree system. Experimental results showed that although our method performed worse it is a complement to the experiments on hot spot determination, on the commonly used hot spot benchmark sets.
What problem does this paper attempt to address?