A semisupervised model to predict regulatory effects of genetic variants at single nucleotide resolution using massively parallel reporter assays

Zikun Yang,Chen Wang,Stephanie Erjavec,Lynn Petukhova,Angela Christiano,Iuliana Ionita-Laza
DOI: https://doi.org/10.1093/bioinformatics/btab040
IF: 5.8
2021-01-30
Bioinformatics
Abstract:Abstract Motivation Predicting regulatory effects of genetic variants is a challenging but important problem in functional genomics. Given the relatively low sensitivity of functional assays, and the pervasiveness of class imbalance in functional genomic data, popular statistical prediction models can sharply underestimate the probability of a regulatory effect. We describe here the presence-only model (PO-EN), a type of semisupervised model, to predict regulatory effects of genetic variants at sequence-level resolution in a context of interest by integrating a large number of epigenetic features and massively parallel reporter assays (MPRAs). Results Using experimental data from a variety of MPRAs we show that the presence-only model produces better calibrated predicted probabilities and has increased accuracy relative to state-of-the-art prediction models. Furthermore, we show that the predictions based on pretrained PO-EN models are useful for prioritizing functional variants among candidate eQTLs and significant SNPs at GWAS loci. In particular, for the costimulatory locus, associated with multiple autoimmune diseases, we show evidence of a regulatory variant residing in an enhancer 24.4 kb downstream of CTLA4, with evidence from capture Hi-C of interaction with CTLA4. Furthermore, the risk allele of the regulatory variant is on the same risk increasing haplotype as a functional coding variant in exon 1 of CTLA4, suggesting that the regulatory variant acts jointly with the coding variant leading to increased risk to disease. Availability and implementation The presence-only model is implemented in the R package ‘PO.EN’, freely available on CRAN. A vignette describing a detailed demonstration of using the proposed PO-EN model can be found on github at https://github.com/Iuliana-Ionita-Laza/PO.EN/ Supplementary information Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?