Rfam 15: RNA families database in 2025

Nancy Ontiveros-Palacios,Emma Cooke,Eric P. Nawrocki,Sandra Triebel,Manja Marz,Elena Rivas,Sam Griffiths-Jones,Anton I Petrov,Alex Bateman,Blake Sweeney
DOI: https://doi.org/10.1101/2024.09.23.614430
2024-09-24
Abstract:The Rfam database, a widely-used repository of non-coding RNA (ncRNA) families, has undergone significant updates in release 15.0. This paper introduces major improvements, including the expansion of Rfamseq to 26,106 genomes, a 76% increase, incorporating the latest UniProt reference proteomes and additional viral genomes. Sixty-five RNA families were enhanced using experimentally determined 3D structures, improving the accuracy of consensus secondary structures and annotations. R-scape covariation analysis was used to refine structural predictions in 26 families. Gene Ontology and Sequence Ontology annotations were comprehensively updated, increasing GO term coverage to 75% of families. The release adds 14 new Hepatitis C Virus RNA families and completes microRNA family synchronisation with miRBase, resulting in 1,603 microRNA families. New data types, including FULL alignments, have been implemented. Integration with APICURON for improved curator attribution and multiple website enhancements further improve user experience. These updates significantly expand Rfam's coverage and improve annotation quality, reinforcing its critical role in RNA research, genome annotation, and the development of machine learning models. Rfam is freely available at https://rfam.org.
Genomics
What problem does this paper attempt to address?
This paper aims to solve several key problems in the annotation and classification of non - coding RNA (ncRNA) families in the Rfam database, specifically including the following aspects: 1. **Expanding the Rfamseq database**: - Rfamseq is the core sequence library of the Rfam database. This update expands Rfamseq to 26,106 genomes, increasing the genomic data by 76%. This includes the latest UniProt reference proteomes and additional viral genomes. - The updated Rfamseq significantly improves the coverage of Rfam for different species and viruses, enhancing its application value in genome annotation. 2. **Improving the secondary structure and annotation of RNA families**: - Use experimentally determined 3D structures to improve the consensus secondary structures of 65 RNA families, increasing the accuracy of structure prediction. - Use R - scape covariance analysis to optimize the structure prediction of 26 families, further improving the quality of annotation. 3. **Updating Gene Ontology (GO) and Sequence Ontology (SO) annotations**: - Comprehensively update GO and SO annotations to ensure that each family has at least one up - to - date SO term and increase the coverage of GO terms to 75% of families. - These updates enable Rfam to provide more accurate functional information, which is helpful for training machine - learning models and other bioinformatics tools. 4. **Synchronizing microRNA families with miRBase**: - Complete the synchronization of microRNA families in Rfam with miRBase, adding 1,603 microRNA families. - Ensure that the microRNA families in Rfam are constructed based on the sequences in miRBase, improving the consistency and accuracy of annotation. 5. **Adding new Hepatitis C Virus (HCV) RNA families**: - Add 14 new HCV RNA families, covering non - coding and coding regions in the viral genome, enriching the research resources for viral RNA. Through these improvements, Rfam 15.0 significantly expands its coverage and improves the annotation quality, further consolidating its important position in RNA research, genome annotation, and machine - learning model development.