Complete Reference Genome and Pangenome Expand Biologically Relevant Information for Genome-Wide DNA Methylation Analysis Using Short-Read Sequencing and Array Data

Zheng Dong,Joanne Whitehead,Maggie Fu,Julia L. MacIsaac,David H. Rehkopf,Luis Rosero-Bixby,Michael S. Kobor,Keegan Korthauer
DOI: https://doi.org/10.1101/2024.10.07.617116
2024-10-11
Abstract:Background: The new complete telomere-to-telomere human genome assembly, T2T-CHM13, and the first draft of the human pangenome reference provide unique opportunities to update the reference genome for epigenetics investigations and clinical research. However, it is largely unclear how these reference genome updates may impact DNA methylation (DNAm) analysis. Results: Compared to the previous GRCh38 assembly, we found an average increase of 7.4% (range 5.4%-9.9% across samples and sequencing methods) in the number of CpGs genome-wide using T2T-CHM13 with data from four commonly used short-read sequencing DNAm profiling methods. The increase in number of CpGs facilitated discovery of 88 new differentially methylated CpGs within cancer driver genes in an epigenome-wide association study (EWAS) of colon cancer. Further, by aligning probe sequences from the commonly used and recently released Illumina DNAm arrays to T2T-CHM13 and GRCh38, we showed the enhanced utility of T2T-CHM13 for evaluation of potential probe cross-reactivity (i.e., where probes match multiple regions) and mismatch (i.e., where probes do not perfectly match the target region), resulting in the identification of new and more reproducible sets of unambiguous probes (i.e., probes uniquely mapping to the target region) (HM450K, n = 430,719; EPIC, n = 777,491; EPICv2, n = 859,216). In EWASs of 24 cancer types, an average of 945 additional differentially methylated CpG sites were identified in the new unambiguous probe set rather than in the GRCh38-based unambiguous probe set, with enrichments in cancer driver genes and cancer signaling pathways. Moreover, the pangenome called 4.5% more CpGs on average in short-read sequencing data than T2T-CHM13 and identified cross-population and population-specific unambiguous probes in DNAm arrays, owing to its improved representation of genetic diversity. These additional CpGs were overlapped with the promoters and gene bodies of various biologically and medically relevant genes and pangenome-based unambiguous probes can potentially facilitate the discovery of DNAm alterations in more than 200 cancer driver genes in each cancer type. Conclusions: Use of T2T-CHM13 and pangenome references can benefit epigenome-wide association studies by including CpGs previously unobserved in short-read sequencing data and by improving the identification of unambiguous probes for DNAm arrays, thus expanding biologically relevant information. This study highlights the practical applications of T2T-CHM13 and pangenome for genome biology and provides a basis for expansion of epigenetics investigations.
Genomics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the impact of the new complete telomere - to - telomere human genome assembly (T2T - CHM13) and the human pangenome reference on DNA methylation (DNAm) analysis. Specifically, the authors hope that by using these updated reference genomes, they can: 1. **Increase the number of detectable CpG sites**: By using T2T - CHM13 and the pangenome reference, the number of detectable CpG sites in short - read - length sequencing data is significantly increased, thus expanding the biologically relevant information. 2. **Improve the accuracy of DNA methylation array probes**: By aligning the probe sequences of the Illumina DNA methylation array with T2T - CHM13 and GRCh38, more unambiguously mapped probes are identified, reducing the cross - reactivity and mismatch problems of the probes, thereby improving the reliability and reproducibility of EWAS results. 3. **Discover new differentially methylated CpG sites**: In the epigenome - wide association study (EWAS) of colorectal cancer, more differentially methylated CpG sites are discovered using T2T - CHM13, especially new differentially methylated CpG sites are discovered in cancer - driver genes, providing new biological insights into the mechanism of cancer occurrence. 4. **Expand biomedically relevant gene information**: By using T2T - CHM13 and the pangenome reference, more CpG sites can be identified, which are located in the gene promoters and gene bodies related to cancer signaling pathways and metabolic reprogramming, helping to understand the pathogenesis of cancer more deeply. ### Main conclusions Using T2T - CHM13 and the pangenome reference can bring significant benefits to epigenome - wide association studies, including: - Increasing the number of detectable CpG sites in short - read - length sequencing data. - Improving the unique mapping of Illumina DNA methylation array probes and reducing technical errors. - Discovering more differentially methylated CpG sites related to cancer - driver genes. - Expanding biomedically relevant gene information and providing a basis for future clinical diagnosis, prognosis and treatment. These improvements not only increase the utilization rate of existing data, but also provide broader possibilities for future epigenetic research.