Abstract:Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2-9bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.

Clustering DNA Sequences by Feature Vectors

A New Distribution Vector and Its Application in Genome Clustering.

Fuzzy Kernel Clustering of RNA Secondary Structure Ensemble Using a Novel Similarity Metric.

A New Method to Cluster DNA Sequences Using Fourier Power Spectrum.

Data Clustering Algorithm for DNA Microarray Based on Graph Theory

A Novel Clustering Method Via Nucleotide-Based Fourier Power Spectrum Analysis

A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors

A Novel Alignment-Free Vector Method to Cluster Protein Sequences

Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries

Super Paramagnetic Clustering of DNA Sequences.

Similarity Evaluation of DNA Sequences Based on Frequent Patterns and Entropy

A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications.

A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance

A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

An advanced approach for DNA sequencing and similarities analysis on the basis of groupings of nucleotide bases

A Measure of Dna Sequence Similarity by Fourier Transform with Applications on Hierarchical Clustering

Similarity analysis of DNA sequences through local distribution of nucleotides in strategic neighborhood

A novel fast vector method for genetic sequence comparison

Clustering of DNA words and biological function: A proof of principle

Analysis of DNA sequences through local distribution of nucleotides in strategic neighborhoods

A Domain Driven Mining Algorithm on Gene Sequence Clustering