RNACG: A Universal RNA Sequence Conditional Generation model based on Flow-Matching

Letian Gao,Zhi John Lu
2024-07-29
Abstract:RNA plays a crucial role in diverse life processes. In contrast to the rapid advancement of protein design methods, the work related to RNA is more demanding. Most current RNA design approaches concentrate on specified target attributes and rely on extensive experimental searches. However, these methods remain costly and inefficient due to practical limitations. In this paper, we characterize all sequence design issues as conditional generation tasks and offer parameterized representations for multiple problems. For these problems, we have developed a universal RNA sequence generation model based on flow matching, namely RNACG. RNACG can accommodate various conditional inputs and is portable, enabling users to customize the encoding network for conditional inputs as per their requirements and integrate it into the generation network. We evaluated RNACG in RNA 3D structure inverse folding, 2D structure inverse folding, family-specific sequence generation, and 5'UTR translation efficiency prediction. RNACG attains superior or competitive performance on these tasks compared with other methods. RNACG exhibits extensive applicability in sequence generation and property prediction tasks, providing a novel approach to RNA sequence design and potential methods for simulation experiments with large-scale RNA sequence data.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several challenging issues in RNA sequence design. Compared to the rapid development of protein design methods, RNA design work is more complex and costly, as most current methods focus on specific target properties and rely on extensive experimental searches. These methods still face inefficiencies and high costs in practical applications. To this end, the paper proposes a general RNA sequence generation model based on Flow Matching—RNACG (RNA Sequence Conditional Generation). RNACG can handle various conditional inputs and is highly flexible, allowing users to customize the conditional encoding network according to their needs and integrate it into the generation network. Specifically, RNACG aims to solve the following types of problems: 1. **RNA 3D Structure Inverse Folding**: Generating RNA sequences that can fold into specific 3D structures. 2. **RNA 2D Structure Inverse Folding**: Generating RNA sequences that can fold into specific 2D structures. 3. **Family-Specific Sequence Generation**: Generating sequences that belong to specific RNA families. 4. **5'UTR Translation Efficiency Prediction**: Predicting the translation efficiency of the 5'UTR region. Through these tasks, RNACG demonstrates its broad applicability and superior performance in sequence generation and property prediction, providing new methods for RNA sequence design and potential means for simulating large-scale RNA sequence data experiments.