On minimizers and convolutional filters: theoretical connections and applications to genome analysis

Yun William Yu

2024-01-27

Abstract:Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.

Machine Learning,Artificial Intelligence,Genomics

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore and explain the effectiveness of convolutional neural networks (CNNs) and minimizers in processing classified biological sequences, especially their applications in genome analysis. Specifically: 1. **Theoretical connection**: Through rigorous mathematical analysis, the paper proves that the convolution filters with random Gaussian initialization and the max - pooling operation are equivalent to selecting minimizers in a specific order. Minimizers are achieved by hashing k - mers within a sliding window and selecting the minimum value for sparse representation. And CNNs extract features through randomly initialized convolution filters and max - pooling operations. 2. **Density reduction in repetitive regions**: The paper experimentally demonstrates that this equivalence relationship leads to a decrease in the density of minimizers in highly repetitive sequences, which has been verified in simulation experiments and real - human telomere data. 3. **Application of deep - learning assemblers**: The paper also shows how to use these theories to embed short - read fragments of the synthetic SARS - CoV - 2 genome into 3D Euclidean space to locally preserve the linear distances of read fragments. Although this is not yet a complete assembler, it indicates the potential application of deep learning in genome assembly. 4. **Method selection in computational biology**: The paper discusses the trade - offs between traditional algorithms (such as minimizers) and deep - learning methods (such as CNNs), especially the efficiency and accuracy in processing large amounts of biological sequence data. ### Formula summary - **Selection probability of minimizers**: \[ \Delta(x)=\sum_{s \in S}\|x - s\|_1 \] where \(\Delta(x)\) represents the degree of k - mer \(x\), that is, its total Hamming distance from other unique k - mers. - **Properties of Gaussian convolution hash functions**: \[ \Pr\left(h(x)=\max_{s \in S}h(s)\right)\geq\Pr\left(h(y)=\max_{s \in S}h(s)\right) \] When \(\Delta(x)\geq\Delta(y)\), the above inequality holds, meaning that more unique k - mers are more likely to be selected as maximizers. - **Conditional expectation**: \[ E[h(x)]>E[h(y)] \] When \(\|x-\hat{s}\|_1<\|y - \hat{s}\|_1\), where \(\hat{s}\) is the global maximizer. ### Conclusion Through these theoretical analyses and experimental results, the paper reveals why CNNs are effective in processing classified biological sequences and provides new ideas for combining deep learning with traditional computational biology algorithms in the future.

On minimizers and convolutional filters: theoretical connections and applications to genome analysis

Convolutional Neural Networks: A Promising Deep Learning Architecture for Biological Sequence Analysis

An Exact Transformation of Convolutional Kernels Enables Accurate Identification of Sequence Motifs

Kernel-wise difference minimization for convolutional neural network compression in metaverse

HiCNN: a very deep convolutional neural network to better enhance the resolution of Hi-C data

When less is more: sketching with minimizers in genomics

AIKYATAN: mapping distal regulatory elements using convolutional learning on GPU

Understanding Convolutional Neural Networks for Text Classification

Kernel Orthogonality does not necessarily imply a Decrease in Feature Map Redundancy in CNNs: Convolutional Similarity Minimization

An exact transformation for CNN kernel enables accurate sequence motif identification and leads to a potentially full probabilistic interpretation of CNN

Detection of Potential Viral Sequence from Next Generation Sequencing Data Using Convolutional Neural Network

Coding genomes with gapped pattern graph convolutional network

Application of convolutional neural networks in medical images: a bibliometric analysis

Inherently interpretable position-aware convolutional motif kernel networks for biological sequencing data

A Universal Non-Parametric Approach For Improved Molecular Sequence Analysis

A Consolidated Approach to Convolutional Neural Networks and the Kolmogorov Complexity

A simple refined DNA minimizer operator enables 2-fold faster computation

On the rates of convergence for learning with convolutional neural networks

Optimal Inferential Control of Convolutional Neural Networks

Neural Embeddings for Knn Search in Biological Sequence

Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias