Abstract:Predicting gene function from its DNA sequence is a fundamental challenge in biology. Many deep learning models have been proposed to embed DNA sequences and predict their enzymatic function, leveraging information in public databases linking DNA sequences to an enzymatic function label. However, much of the scientific community's knowledge of biological function is not represented in these categorical labels, and is instead captured in unstructured text descriptions of mechanisms, reactions, and enzyme behavior. These descriptions are often captured alongside DNA sequences in biological databases, albeit in an unstructured manner. Deep learning of models predicting enzymatic function are likely to benefit from incorporating this multi-modal data encoding scientific knowledge of biological function. There is, however, no dataset designed for machine learning algorithms to leverage this multi-modal information. Here we propose a novel dataset and benchmark suite that enables the exploration and development of large multi-modal neural network models on gene DNA sequences and natural language descriptions of gene function. We present baseline performance on benchmarks for both unsupervised and supervised tasks that demonstrate the difficulty of this modeling objective, while demonstrating the potential benefit of incorporating multi-modal data types in function prediction compared to DNA sequences alone. Our dataset is at: <a class="link-external link-https" href="https://hoarfrost-lab.github.io/BioTalk/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is **predicting the function of enzymes from gene sequences**, which is a fundamental challenge in biology. Although many deep - learning models have been proposed for embedding DNA sequences and predicting their enzyme functions, these models mainly rely on classification labels in public databases, and these labels cannot fully represent the scientific community's knowledge of biological functions. A great deal of descriptions about mechanisms, reactions and enzyme behaviors exist in unstructured text forms in biological databases, and this information is of great value for improving the accuracy of enzyme function prediction. In order to make full use of these multi - modal data (i.e., DNA sequences and natural - language descriptions), the authors propose a new dataset and benchmarking suite, aiming to promote the exploration and development of large multi - modal neural - network models on gene DNA sequences and natural - language descriptions. This dataset not only supports benchmarking for unsupervised and supervised tasks, but also shows the potential advantages of combining multi - modal data types for function prediction compared to using only DNA sequences. Specifically, the key contributions of the paper include: 1. **Novel dataset**: Provides a unique comprehensive dataset that pairs DNA sequences with their corresponding functional descriptions, filling a key gap in existing resources. 2. **Multi - modal application**: This dataset promotes the development of multi - modal language models that can generate detailed natural - language descriptions from DNA sequences. 3. **Unimodal and multi - modal benchmarking**: In addition to supporting multi - modal applications, the dataset also provides unimodal and multi - modal model benchmarking, including encoder - only transformer models pre - trained on DNA sequences to enhance their performance on various tasks. 4. **Impact**: This dataset promotes the creation of DNA - language models, which can be widely used in function prediction, sequence "annotation" and natural - language design of new genes, significantly improving the interpretability and practicality of genomic data. Through this research, the authors hope to promote the development of multi - modal learning frameworks, which are not only beneficial to the field of biology, but also can provide new insights into handling heterogeneous data for the broader machine - learning community, thereby improving the generalization ability of models.

A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

A sandbox for prediction and integration of DNA, RNA, and proteins in single cells

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

Benchmarking DNA Foundation Models for Genomic Sequence Classification

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life

ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab

Multimodal learning of noncoding variant effects using genome sequence and chromatin structure

Predicting the sequence specificities of DNA-binding proteins by DNA Fine-tuned Language Model with decaying learning rates

Idna-Abf: Multi-Scale Deep Biological Language Learning Model for the Interpretable Prediction of DNA Methylations

COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Employing bimodal representations to predict DNA bendability within a self-supervised pre-trained framework

D2VCB: A Hybrid Deep Neural Network for the Prediction of in-vivo Protein-DNA Binding from Combined DNA Sequence

Multi-modal Transfer Learning between Biological Foundation Models

Time series-based hybrid ensemble learning model with multivariate multidimensional feature coding for DNA methylation prediction

The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling

Does your model understand genes? A benchmark of gene properties for biological and text models