G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data

Farica Zhuang,Danielle Gutman,Nathaniel Islas,Bryan B Guzman,Alli Jimenez,San Jewell,Nicholas J Hand,Katherine L Nathanson,Daniel Dominguez,Yoseph Barash
DOI: https://doi.org/10.1101/2024.10.01.616124
2024-10-03
Abstract:RNA G-quadruplexes (rG4s) are key regulatory elements in gene expression, yet the effects of genetic variants on rG4 formation remain underexplored. Here, we introduce G4mer, an RNA language model that predicts rG4 formation and evaluates the effects of genetic variants across the transcriptome. G4mer significantly improves accuracy over existing methods, highlighting sequence length and flanking motifs as important rG4 features. Applying G4mer to 5' untranslated region (UTR) variations, we identify variants in breast cancer-associated genes that alter rG4 formation and validate their impact on structure and gene expression. These results demonstrate the potential of integrating computational models with experimental approaches to study rG4 function, especially in diseases where non-coding variants are often overlooked. To support broader applications, G4mer is available as both a web tool and a downloadable model.
Biology
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily addresses the following issues: 1. **Predicting the Formation of RNA G-Quadruplexes (rG4)**: - Proposes a Transformer-based RNA language model, G4mer, to predict rG4 formation across the entire transcriptome and assess the impact of genetic variations on rG4 formation. - G4mer significantly outperforms existing methods across various datasets and experimental techniques. 2. **Identifying rG4 Subtypes**: - G4mer not only predicts rG4 formation but also identifies different types of rG4 subtypes (such as long loops, bulges, etc.), and it performs better than existing models in this task as well. 3. **Analyzing the Impact of Genetic Variations on rG4 Structure**: - Utilizes G4mer to analyze single nucleotide variations (SNVs) across the entire transcriptome, particularly rG4-disruptive variations in the 5' and 3' untranslated regions (UTRs), and detects the functional significance of these variations. - Finds that rG4-disruptive variations show significant signals of negative selection, indicating that these structures have important biological functions. 4. **Exploring the Association Between rG4 and Diseases**: - By applying G4mer to the analysis of variations in breast cancer-related genes, it discovers that some rG4-disruptive variations are associated with the occurrence of breast cancer and experimentally validates the impact of these variations on protein expression. - Notably, significant rG4-altering variations are found in the EPN3 and MSH6 genes, which may affect gene expression and protein function, thereby influencing cancer susceptibility. In summary, the paper aims to develop an efficient and accurate tool to predict rG4 structures and use this tool to uncover the potential role of rG4 structures in disease occurrence.