Large-scale discovery of novel neurodevelopmental disorder-related genes through a unified analysis of single-nucleotide and copy number variants

Kohei Hamanaka,Noriko Miyake,Takeshi Mizuguchi,Satoko Miyatake,Yuri Uchiyama,Naomi Tsuchida,Futoshi Sekiguchi,Satomi Mitsuhashi,Yoshinori Tsurusaki,Mitsuko Nakashima,Hirotomo Saitsu,Kohei Yamada,Masamune Sakamoto,Hiromi Fukuda,Sachiko Ohori,Ken Saida,Toshiyuki Itai,Yoshiteru Azuma,Eriko Koshimizu,Atsushi Fujita,Biray Erturk,Yoko Hiraki,Gaik-Siew Ch'ng,Mitsuhiro Kato,Nobuhiko Okamoto,Atsushi Takata,Naomichi Matsumoto
DOI: https://doi.org/10.1186/s13073-022-01042-w
2022-04-26
Abstract:Background: Previous large-scale studies of de novo variants identified a number of genes associated with neurodevelopmental disorders (NDDs); however, it was also predicted that many NDD-associated genes await discovery. Such genes can be discovered by integrating copy number variants (CNVs), which have not been fully considered in previous studies, and increasing the sample size. Methods: We first constructed a model estimating the rates of de novo CNVs per gene from several factors such as gene length and number of exons. Second, we compiled a comprehensive list of de novo single-nucleotide variants (SNVs) in 41,165 individuals and de novo CNVs in 3675 individuals with NDDs by aggregating our own and publicly available datasets, including denovo-db and the Deciphering Developmental Disorders study data. Third, summing up the de novo CNV rates that we estimated and SNV rates previously established, gene-based enrichment of de novo deleterious SNVs and CNVs were assessed in the 41,165 cases. Significantly enriched genes were further prioritized according to their similarity to known NDD genes using a deep learning model that considers functional characteristics (e.g., gene ontology and expression patterns). Results: We identified a total of 380 genes achieving statistical significance (5% false discovery rate), including 31 genes affected by de novo CNVs. Of the 380 genes, 52 have not previously been reported as NDD genes, and the data of de novo CNVs contributed to the significance of three genes (GLTSCR1, MARK2, and UBR3). Among the 52 genes, we reasonably excluded 18 genes [a number almost identical to the theoretically expected false positives (i.e., 380 × 0.05 = 19)] given their constraints against deleterious variants and extracted 34 "plausible" candidate genes. Their validity as NDD genes was consistently supported by their similarity in function and gene expression patterns to known NDD genes. Quantifying the overall similarity using deep learning, we identified 11 high-confidence (> 90% true-positive probabilities) candidate genes: HDAC2, SUPT16H, HECTD4, CHD5, XPO1, GSK3B, NLGN2, ADGRB1, CTR9, BRD3, and MARK2. Conclusions: We identified dozens of new candidates for NDD genes. Both the methods and the resources developed here will contribute to the further identification of novel NDD-associated genes.
What problem does this paper attempt to address?