CLASSIFICATION OF PROTEIN HOMO-OLIGOMERS USING AMINO ACID COMPOSITION DISTRIBUTION

SHI Jian-yu,PAN Quan,ZHANG Shao-wu,CHENG Yong-mei
DOI: https://doi.org/10.3321/j.issn:1000-6737.2006.01.008
2006-01-01
ACTA BIOPHYSICA SINICA
Abstract:Since the gap between sharply increasing known sequences and slow accumulation of known structures is becoming large, an automatic classification process based on the primary sequences and known three-dimensional structure becomes more important nowadays. Meanwhile, a fully automatic and reliable classification system is also necessary due to the importance of primary sequences which contain much useful information for the biologists. Generally, the performance of the classification system can be improved by selecting appropriate algorithm of feature extraction. Thus a novel method of feature extraction (amino acid composition distribution, AACD) from the sequences has been developed to classify the protein homo-oligomers, which is a generalization of the 20 components of the conventional amino acid composition. The primary sequence is equally separated into several segments, and each element of the AACD array can be individually calculated by the count of 20 natural amino acids appearing within each segment divided by the length of corresponding sequence. The classification system takes support vector machines as classifier, and adopts “ One-Versus-One” as multi-class categorization, and finally applies AACD to 4-class homo-oligomers classification from the primary sequence of proteins. The results of 10 fold cross validation (10CV) test show that overall accuracy and accuracy index of AACD are 86.22% and 67.12%, which are 5.74 and 10.03 per cent higher than those of amino acid composition, and 3.12 and 5.63 per cent higher than those of dipeptide composition (amino acid pairs) feature extraction method respectively. Incorporating AACD with the length of protein primary sequence can slightly improve that performance with overall accuracy 86.35% and accuracy index 67.23%. Using two-dimension principle component analysis (2DPCA) to decrease the dimension of those incorporated feature vectors can get better results with overall accuracy 87.12% and accuracy index 68.08% respectively. The results demonstrate that AACD is an effective and reliable method for classifying homo-oligomers and that the length of protein sequence contains some information of homo-oligomers structure and also indicate that 2DPCA is an effective approach to decrease the high dimension of feature vector. The effectiveness of classification of homo-oligomers encourages further exploration of AACD.
What problem does this paper attempt to address?