Cheng Tan,Zhangyang Gao,Hanqun Cao,Xingran Chen,Ge Wang,Lirong Wu,Jun Xia,Jiangbin Zheng,Stan Z. Li
Abstract:The secondary structure of ribonucleic acid (RNA) is more stable and accessible in the cell than its tertiary structure, making it essential for functional prediction. Although deep learning has shown promising results in this field, current methods suffer from poor generalization and high complexity. In this work, we reformulate the RNA secondary structure prediction as a K-Rook problem, thereby simplifying the prediction process into probabilistic matching within a finite solution space. Building on this innovative perspective, we introduce RFold, a simple yet effective method that learns to predict the most matching K-Rook solution from the given sequence. RFold employs a bi-dimensional optimization strategy that decomposes the probabilistic matching problem into row-wise and column-wise components to reduce the matching complexity, simplifying the solving process while guaranteeing the validity of the output. Extensive experiments demonstrate that RFold achieves competitive performance and about eight times faster inference efficiency than the state-of-the-art approaches. The code and Colab demo are available in (<a class="link-external link-http" href="http://github.com/A4Bio/RFold" rel="external noopener nofollow">this http URL</a>).
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two major challenges in RNA secondary structure prediction: **poor generalization ability** and **high computational complexity**. Although deep - learning methods have shown certain effectiveness in RNA secondary structure prediction, the existing methods still have the following problems:
1. **Insufficient generalization ability**: The performance of existing methods on new data sets is not good, especially when dealing with long sequences or data across families.
2. **High computational complexity**: Existing methods usually require complex optimization processes, resulting in low inference efficiency.
To overcome these challenges, the author proposes a new perspective and redefines the RNA secondary structure prediction problem as a **K - Rook problem**. Through this redefinition, the author simplifies the prediction process and proposes a simple and effective model **RFold**, which can efficiently predict RNA secondary structure.
### Specific problem description
1. **Importance of RNA secondary structure**:
- The secondary structure of RNA is more stable and easier to obtain than its tertiary structure and is crucial for function prediction.
- However, experimental methods such as X - ray crystallography, nuclear magnetic resonance, and cryo - electron microscopy can determine the secondary structure of RNA, but they have low throughput and are costly.
2. **Limitations of computational methods**:
- Methods for calculating RNA secondary structure are favored because of their high efficiency, but the mainstream methods can be divided into two categories: comparative sequence analysis and single - sequence folding algorithms.
- Comparative sequence analysis depends on the conservation between homologous sequences, but the development of this method is hindered because the number of known RNA families is limited.
- Single - sequence folding algorithms usually use dynamic programming (DP) to minimize energy, but this requires base pairs to have a nested structure, ignoring some biologically important non - nested structures (such as pseudoknots).
3. **Introduction of deep - learning methods**:
- To overcome the limitations of energy methods, researchers have introduced deep - learning techniques.
- However, existing deep - learning methods usually require complex constrained optimization processes, which not only increase computational complexity but may also lead to sub - optimal or invalid solutions.
### Solutions
1. **Introduction of the K - Rook problem**:
- The author redefines the RNA secondary structure prediction problem as a K - Rook problem, that is, placing K non - attacking rooks on an L×L chessboard so that they form a symmetric pattern.
- This redefinition simplifies the prediction process into a probability - matching problem, thereby reducing computational complexity.
2. **Proposal of the RFold model**:
- The RFold model adopts a two - dimensional optimization strategy and decomposes the probability - matching problem into row - and column - direction components, further simplifying the solution process.
- In this way, RFold can achieve efficient prediction while ensuring the validity of the output.
### Experimental results
1. **Standard RNA secondary structure prediction**:
- On the RNAStralign test set, RFold performs well on all metrics, especially in terms of precision, which is about 8% higher than the state - of - the - art methods.
2. **Generalization ability assessment**:
- Experiments on the ArchiveII data set show that RFold has strong generalization ability, and its F1 score reaches 0.921, significantly better than other methods.
3. **Large - scale benchmark assessment**:
- On the bpRNA data set, RFold improves the F1 score by 4.0% compared to the previous state - of - the - art method SPOT - RNA.
4. **Long - range interaction prediction**:
- In long - range base - pairing prediction, RFold performs well and is significantly better than UFold.
5. **Cross - family assessment**:
- On the bpRNA - new data set, RFold reaches an F1 score of 0.651, second only to the thermodynamics - based method Contrafold.
6. **Pseudoknot prediction**:
- In the prediction of RNA structures containing pseudoknots, RFold also performs well, showing its advantage in dealing with complex structures.
In summary, this paper effectively solves the problems of existing methods in generalization by redefining the RNA secondary structure prediction problem as a K - Rook problem and proposing the RFold model.