Diverse Database and Machine Learning Model to Narrow the Generalization Gap in RNA Structure Prediction
Albéric A. de Lajarte,Yves J. Martin des Taillades,Colin Kalicki,Federico Fuchs Wightman,Justin Aruda,Dragui Salazar,Matthew F. Allan,Casper L’Esperance-Kerckhoff,Alex Kashi,Fabrice Jossinet,Silvi Rouskin
DOI: https://doi.org/10.1101/2024.01.24.577093
2024-04-03
Abstract:Understanding macromolecular structures of proteins and nucleic acids is critical for discerning their functions and biological roles. Advanced techniques—crystallography, NMR, and CryoEM—have facilitated the determination of over 180,000 protein structures, all cataloged in the Protein Data Bank (PDB). This comprehensive repository has been pivotal in developing deep learning algorithms for predicting protein structures directly from sequences. In contrast, RNA structure prediction has lagged, and suffers from a scarcity of structural data. Here, we present the secondary structure models of 1098 pri-miRNAs and 1456 human mRNA regions determined through chemical probing. We develop a novel deep learning architecture, inspired from the Evoformer model of Alphafold and traditional architectures for secondary structure prediction. This new model, eFold, was trained on our newly generated database and over 300,000 secondary structures across multiple sources. We benchmark eFold on two new test sets of long and diverse RNA structures and show that our dataset and new architecture contribute to increasing the prediction performance, compared to similar state-of-the-art methods. All together, our results reveal that merely expanding the database size is insufficient for generalization across families, whereas incorporating a greater diversity and complexity of RNAs structures allows for enhanced model performance.
Biochemistry