Expanded methylome and quantitative trait loci detection by long-read profiling of personal DNA

Cristian Groza,Bing Ge,Warren Cheung,Tomi Pastinen,Guillaume Bourque
DOI: https://doi.org/10.1101/2024.03.17.585420
2024-03-19
Abstract:Structural variants (SVs) are omnipresent in human DNA, yet their genotype and methylation status is rarely characterized due to previous limitations in genome assembly and detection of modified nucleotides. Because of this, the extent to which these regions act as quantitative-trait loci is also largely unknown. Here, we generated a pangenome graph summarizing the SVs in 782 de novo assembled genomes obtained from the Genomic Answers for Kids rare disease cohort, that captures 14.6 million CpGs in DNA segments that are absent from the CHM13v2 assembly (SV-CpGs), expanding their number by 43.6%. Next, using 435 methylomes from the same samples, we genotyped a total of 7.99 million SV-CpGs, of which 5.18 million (64.8%) were found to be methylated (SV-5mCpGs) in at least one sample. To understand the provenance and impact of these novel SV-CpGs, we noted that non-repeat sequences were the leading contributor of SV-CpGs (3.3 × 10 ), followed by centromeric satellites (1.58 × 10 ), simple repeats (1.19 × 10 ), Alus (0.67 × 10 ), satellites (0.39 × 10 ), L1s (0.27 × 10 ), and SVAs (0.19 × 10 ). Meanwhile, the methylation rate of SV-CpGs was the highest in repeat sequences. Moreover, in contrast to Alus and L1s, centromeric satellites, simple repeats and SVA sequences were overrepresented in SV-5mCpGs compared to reference CpGs. Similarly, we established that non-reference CpGs were more than twice (37% vs. 15%) as likely to be variable, showing intermediate methylation levels in the population. Lastly, to explore if SVs detected in this pangenome are potentially causal for functional variation in population we measured methylation quantitative trait loci (SV-mQTLs) using CHM13v2 as a backbone. This revealed over 230,464 methylation bins within 100 kbp of a common SV (>5% MAF) showing significant association (at 5% FDR) with methylation variation. Finally, we assessed how many of these SVs-mQTLs were the leading QTL variant compared to SNVs and identified 65,659 methylation bins (28.5%) where the leading variant was an SV. In conclusion, our results demonstrate that graph genome references providing full SV structures in combination with the associated methylation variation reveal tens-of-thousands of QTLs that are more accurately mapped in personal genomes, underscoring the importance of assembly-based analyses of human traits.
Bioinformatics
What problem does this paper attempt to address?