Complex genetic variation in nearly complete human genomes
Glennis A. Logsdon,Peter Ebert,Peter A. Audano,Mark Loftus,David Porubsky,Jana Ebler,Feyza Yilmaz,Pille Hallast,Timofey Prodanov,DongAhn Yoo,Carolyn A. Paisie,William T. Harvey,Xuefang Zhao,Gianni V. Martino,Mir Henglin,Katherine M. Munson,Keon Rabbani,Chen-Shan Chin,Bida Gu,Hufsah Ashraf,Olanrewaju Austine-Orimoloye,Parithi Balachandran,Marc Jan Bonder,Haoyu Cheng,Zechen Chong,Jonathan Crabtree,Mark Gerstein,Lisbeth A Guethlein,Patrick Hasenfeld,Glenn Hickey,Kendra Hoekzema,Sarah E Hunt,Matthew Jensen,Yunzhe Jiang,Sergey Koren,Youngjun Kwon,Chong Li,Heng Li,Jiaqi Li,Paul J Norman,Keisuke K. Oshima,Benedict Paten,Adam M. Phillippy,Nicholas R Pollock,Tobias Rausch,Mikko Rautiainen,Stephan Scholz,Yuwei Song,Arda Soylev,Arvis Sulovari,Likhitha Surapaneni,Vasiliki Tsapalou,Weichen Zhou,Ying Zhou,Qihui Zhu,Michael C. Zody,Ryan E. Mills,Scott E. Devine,Xinghua Shi,Mike E Talkowski,Mark J.P. Chaisson,Alexander T Dilthey,Miriam K. Konkel,Jan O. Korbel,Charles Lee,Christine R. Beck,Evan E. Eichler,Tobias Marschall
DOI: https://doi.org/10.1101/2024.09.24.614721
2024-09-25
Abstract:Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here, we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (130 Mbp median continuity), closing 92% of all previous assembly gaps and reaching telomere-to-telomere (T2T) status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1/SMN2, NBPF8, and AMY1/AMY2, and fully resolve 1,852 complex structural variants (SVs). In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite high-order repeat (HOR) array length and characterize the pattern of mobile element insertions into α-satellite HOR arrays. While most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference to a median quality value (QV) of 45. Using this approach, 26,115 SVs per sample are detected, substantially increasing the number of SVs now amenable to downstream disease association studies.
Genomics