Abstract:Sequence organizations are viewed from two points: one is from informational redundancy or informational correlation (IC) and another is from k-mer frequency statistics. Two problems are investigated. The first is how the ICs exceed the fluctuation bound and the order emerges from fluctuation in a genome when the sequence length attains some critical value. We demonstrated that the transition from fluctuation to order takes place at about sequence length 200-300 thousands bases for human and E coli genome. It means that the life emerges from a region between macroscopic and microscopic. The second is about the statistical law of the k-mer organization in a genome under the evolutionary pressure and functional selection. We deduced a sum rule Q(k,N) on the k-mer frequency deviations from the randomness in a N-long sequence of genome and deduced the relations of Q(k,N) with k and N. We found that Q(k,N) increases with length N at a constant rate for most genome sequences and demonstrated that when the functional selection of k-mers is accumulated to some critical value the ordering takes place. An important finding is the sum rule correlated with the evolutionary complexity of the genome.
What problem does this paper attempt to address?
This paper attempts to solve two main problems:
1. **Emergence of information correlation exceeding the fluctuation limit and orderliness**:
- The paper explores how in the genome, when the sequence length reaches a certain critical value, the Information Correlation (IC) exceeds the fluctuation limit and orderliness emerges from the fluctuations. Specifically, by analyzing the genomes of humans and Escherichia coli (E. coli), the author found that this transition from fluctuation to order occurs at a sequence length of approximately 200,000 to 300,000 base pairs. This indicates that life emerges from the region between the macroscopic and the microscopic.
2. **Statistical laws of k - mer frequency statistics under evolutionary pressure and functional selection**:
- The paper also studies the statistical laws of k - mer (nucleotide sequence fragments of length k) organization in the genome under evolutionary pressure and functional selection. The author derived a sum rule \(Q(k, N)\) regarding the deviation of k - mer frequencies from the random distribution and explored the relationship between \(Q(k, N)\) and k and N. It was found that \(Q(k, N)\) increases at a constant rate with the sequence length N, and when the functional selection accumulates to a certain critical value, orderliness will emerge. An important finding is that this sum rule is related to the evolutionary complexity of the genome.
### Main Conclusions
- **Transition from fluctuation to order**: For most genome sequences, when the sequence length reaches about 200,000 to 300,000 base pairs, the information correlation will exceed the fluctuation limit, thus producing orderliness.
- **Non - random sum rule of k - mer frequencies**: The degree \(Q(k, N)\) of deviation of k - mer frequencies in the genome from the random distribution increases with the increase of the sequence length N, and the rate of this increase remains constant in most genome sequences. When \(Q(k, N)\) reaches a certain critical value, orderliness will emerge.
### Formula Summary
- **Information entropy**:
\[
H = -\sum_i p_i \log_2 p_i
\]
- **First - order information redundancy**:
\[
D_1 = \log_2 4 - H = \sum_i p_i \log_2 \frac{1}{p_i} - \log_2 4
\]
- **Second - order information redundancy**:
\[
D_2 = \log_2 4 - H_M = \log_2 4 - \sum_{i,j} p_{i|j} \log_2 p_{i|j}
\]
- **Sum rule of k - mer frequency deviation**:
\[
Q(k, N) = \frac{\sigma^2(k, N)}{\alpha_k}
\]
where \(\sigma^2(k, N)\) is the variance of k - mer frequencies, and \(\alpha_k\) is the expected value of k - mer frequencies in a random sequence.
These research results provide an important theoretical basis for understanding the information organization and evolution in the genome.