On minimizers and convolutional filters: theoretical connections and applications to genome analysis

Yun William Yu
2024-01-27
Abstract:Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.
Machine Learning,Artificial Intelligence,Genomics
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore and explain the effectiveness of convolutional neural networks (CNNs) and minimizers in processing classified biological sequences, especially their applications in genome analysis. Specifically: 1. **Theoretical connection**: Through rigorous mathematical analysis, the paper proves that the convolution filters with random Gaussian initialization and the max - pooling operation are equivalent to selecting minimizers in a specific order. Minimizers are achieved by hashing k - mers within a sliding window and selecting the minimum value for sparse representation. And CNNs extract features through randomly initialized convolution filters and max - pooling operations. 2. **Density reduction in repetitive regions**: The paper experimentally demonstrates that this equivalence relationship leads to a decrease in the density of minimizers in highly repetitive sequences, which has been verified in simulation experiments and real - human telomere data. 3. **Application of deep - learning assemblers**: The paper also shows how to use these theories to embed short - read fragments of the synthetic SARS - CoV - 2 genome into 3D Euclidean space to locally preserve the linear distances of read fragments. Although this is not yet a complete assembler, it indicates the potential application of deep learning in genome assembly. 4. **Method selection in computational biology**: The paper discusses the trade - offs between traditional algorithms (such as minimizers) and deep - learning methods (such as CNNs), especially the efficiency and accuracy in processing large amounts of biological sequence data. ### Formula summary - **Selection probability of minimizers**: \[ \Delta(x)=\sum_{s \in S}\|x - s\|_1 \] where \(\Delta(x)\) represents the degree of k - mer \(x\), that is, its total Hamming distance from other unique k - mers. - **Properties of Gaussian convolution hash functions**: \[ \Pr\left(h(x)=\max_{s \in S}h(s)\right)\geq\Pr\left(h(y)=\max_{s \in S}h(s)\right) \] When \(\Delta(x)\geq\Delta(y)\), the above inequality holds, meaning that more unique k - mers are more likely to be selected as maximizers. - **Conditional expectation**: \[ E[h(x)]>E[h(y)] \] When \(\|x-\hat{s}\|_1<\|y - \hat{s}\|_1\), where \(\hat{s}\) is the global maximizer. ### Conclusion Through these theoretical analyses and experimental results, the paper reveals why CNNs are effective in processing classified biological sequences and provides new ideas for combining deep learning with traditional computational biology algorithms in the future.