A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

Yuchen Zhang,Ratish Kumar Chandrakant Jha,Soumya Bharadwaj,Vatsal Sanjaykumar Thakkar,Adrienne Hoarfrost,Jin Sun
2024-07-22
Abstract:Predicting gene function from its DNA sequence is a fundamental challenge in biology. Many deep learning models have been proposed to embed DNA sequences and predict their enzymatic function, leveraging information in public databases linking DNA sequences to an enzymatic function label. However, much of the scientific community's knowledge of biological function is not represented in these categorical labels, and is instead captured in unstructured text descriptions of mechanisms, reactions, and enzyme behavior. These descriptions are often captured alongside DNA sequences in biological databases, albeit in an unstructured manner. Deep learning of models predicting enzymatic function are likely to benefit from incorporating this multi-modal data encoding scientific knowledge of biological function. There is, however, no dataset designed for machine learning algorithms to leverage this multi-modal information. Here we propose a novel dataset and benchmark suite that enables the exploration and development of large multi-modal neural network models on gene DNA sequences and natural language descriptions of gene function. We present baseline performance on benchmarks for both unsupervised and supervised tasks that demonstrate the difficulty of this modeling objective, while demonstrating the potential benefit of incorporating multi-modal data types in function prediction compared to DNA sequences alone. Our dataset is at: <a class="link-external link-https" href="https://hoarfrost-lab.github.io/BioTalk/" rel="external noopener nofollow">this https URL</a>.
Genomics,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is **predicting the function of enzymes from gene sequences**, which is a fundamental challenge in biology. Although many deep - learning models have been proposed for embedding DNA sequences and predicting their enzyme functions, these models mainly rely on classification labels in public databases, and these labels cannot fully represent the scientific community's knowledge of biological functions. A great deal of descriptions about mechanisms, reactions and enzyme behaviors exist in unstructured text forms in biological databases, and this information is of great value for improving the accuracy of enzyme function prediction. In order to make full use of these multi - modal data (i.e., DNA sequences and natural - language descriptions), the authors propose a new dataset and benchmarking suite, aiming to promote the exploration and development of large multi - modal neural - network models on gene DNA sequences and natural - language descriptions. This dataset not only supports benchmarking for unsupervised and supervised tasks, but also shows the potential advantages of combining multi - modal data types for function prediction compared to using only DNA sequences. Specifically, the key contributions of the paper include: 1. **Novel dataset**: Provides a unique comprehensive dataset that pairs DNA sequences with their corresponding functional descriptions, filling a key gap in existing resources. 2. **Multi - modal application**: This dataset promotes the development of multi - modal language models that can generate detailed natural - language descriptions from DNA sequences. 3. **Unimodal and multi - modal benchmarking**: In addition to supporting multi - modal applications, the dataset also provides unimodal and multi - modal model benchmarking, including encoder - only transformer models pre - trained on DNA sequences to enhance their performance on various tasks. 4. **Impact**: This dataset promotes the creation of DNA - language models, which can be widely used in function prediction, sequence "annotation" and natural - language design of new genes, significantly improving the interpretability and practicality of genomic data. Through this research, the authors hope to promote the development of multi - modal learning frameworks, which are not only beneficial to the field of biology, but also can provide new insights into handling heterogeneous data for the broader machine - learning community, thereby improving the generalization ability of models.