Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families

Ioanna Kalvari,Joanna Argasinska,Natalia Quinones-Olvera,Eric P. Nawrocki,Elena Rivas,Sean R. Eddy,Alex Bateman,Robert D. Finn,Anton I. Petrov,Eric P Nawrocki,Sean R Eddy,Robert D Finn,Anton I Petrov
DOI: https://doi.org/10.1093/nar/gkx1038
IF: 14.9
2017-11-03
Nucleic Acids Research
Abstract:The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.
biochemistry & molecular biology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the scalability and accuracy of the Rfam database, especially by shifting to genome - centered resources to better annotate non - coding RNA (ncRNA) families. Specifically, version 13.0 of Rfam introduced a new genome - centered method for annotating RNA families in reference genomes and improved the content and functionality of the database. ### Main Problems and Solutions 1. **Reduce Data Redundancy**: - The sequence databases previously used by Rfam (such as ENA's standard and whole - genome shotgun sequence sets) had a large amount of redundant data, making it impractical to search and construct new families. - The new version uses a non - redundant set of reference genomes, reducing redundancy and improving the accuracy and efficiency of annotation. 2. **Improve Annotation Quality**: - A new literature curation workflow was introduced, speeding up the process of extracting RNA sequences from the literature and ensuring the quality of these sequences. - The R - scape tool was used to evaluate and improve RNA secondary structure models, ensuring the accuracy and reliability of the models. 3. **Expand the Number of RNA Families**: - 236 new RNA families were added, bringing the total number of families to 2,687. - These new families cover multiple types of RNA, including small RNA (sRNA), thermoregulators, riboswitches, etc. 4. **Enhance User Experience**: - The search function of the website was improved, and faceted text search was introduced, allowing users to more conveniently browse and compare RNA families in different species. - More detailed sequence summary pages were provided, including download links and links to other related resources. ### Formulas and Technical Details - **R - scape Analysis**: It is used to evaluate the statistical significance of RNA secondary structures. For example, in the analysis of the SAM riboswitch, R - scape increased the number of statistically significant base pairs from 19 to 27, indicating that the seed alignment may need to be updated. \[ \text{R - scape}=\sum_{i = 1}^{N}\log\left(\frac{P(\text{covariation}|H_1)}{P(\text{covariation}|H_0)}\right) \] Here, \(P(\text{covariation}|H_1)\) and \(P(\text{covariation}|H_0)\) represent the covariation probabilities under the assumptions of having and not having a conserved structure, respectively. ### Conclusion By shifting to a genome - centered approach, Rfam 13.0 not only reduced data redundancy but also improved the accuracy and efficiency of annotation. In addition, the new workflows and technical tools made the construction and validation of RNA families more efficient and reliable, thus providing stronger support for non - coding RNA research.