A Domain Driven Mining Algorithm on Gene Sequence Clustering

Yun Xiong,Ming Chen,Yangyong Zhu
DOI: https://doi.org/10.1007/978-0-387-79420-4_8
2009-01-01
Abstract:Recent biological experiments argue that similar gene sequences measured by permutation of the nucleotides do not necessarily share functional similarity. As a result, the state-of-the-art clustering algorithms by which to annotate genes with similar function solely based on sequence composition may cause failure. The recent study of gene clustering techniques that incorporate prior knowledge of the biological domain is deemed to be an essential research subject of data mining, specifically aiming at one for biological sequences. It is now commonly accepted that co-expressed genes generally belong to the same functional category. In this paper, a new similarity metric for gene sequence clustering based on features of such co-expressed genes is proposed, namely ‘Tendency Similarity on N-Same-Dimensions’, in terms of which a domain driven algorithm ‘DD-Cluster’ is designed to group together gene sequences into ‘Similar Tendency Clusters on N-Same-Dimensions’, i.e., co-expressed gene clusters. Compared with earlier clustering methods considering composition of gene sequences alone, the resulting ‘Similar Tendency Clusters on N-Same-Dimensions’ proved more reliable for assisting biologists in gene function annotation. The algorithm has been tested on real data sets and has shown high performance, the clustering results having demonstrated effectiveness.
What problem does this paper attempt to address?