Integration of protein and coding sequences enables mutual augmentation of the language model

Heng-Rui Zhao,Meng-Ting Cheng,Jinhua Zhu,Hao Wang,Xiang-Rui Yang,Bo Wang,Yuan-Xin Sun,Ming-Hao Fang,Enhong Chen,Houqiang Li,Shu-Jing Han,Yuxing Chen,Cong-Zhao Zhou
DOI: https://doi.org/10.1101/2024.10.24.620004
2024-10-29
Abstract:Recent language models have significantly accelerated our understanding on the massive biological data, using protein or DNA/RNA sequences as a single-language modality. Here we present a dual-language foundation model, which integrates both protein and coding sequences (CDS) for pre-training. Compared to the benchmark models, it shows a superior performance up to ~20% on both protein and mRNA-related discriminative tasks, and gains the capacity to de novo generate coding sequences of ~50% increased protein yield. Moreover, the model also possesses the knowledge transferability from the pre-training data to the upstream 5' untranslated regions. These findings indicate the intrinsic correlations between protein and its CDS, as well as the coding region and beyond. It provides a new paradigm that leverages the multiple-language foundation model to interpret the hidden context of distinct corpora/biological languages, which could be further applied to mine the yet-unknown biological information/correlation beyond the Central Dogma.
Bioinformatics
What problem does this paper attempt to address?