EVALUATION OF THE ASSOCIATION BETWEEN THE PATHOGENESIS OF TYPE 2 DIABETES AND GENOME-WIDE COPY NUMBER VARIATIONS USING THE LASSO METHOD

Wei Zhang,Yuan Ji,Caihong Huang,Jun Ying,Zhengqiang Ye,Guoyou Qin,Naiqing Zhao
DOI: https://doi.org/10.19193/0393-6384_2018_4_176
2018-01-01
Acta medica mediterranea
Abstract:Objective: Type 2 diabetes (T2D) is a complex disease caused by the combination of genetic factors and environmental factors. To date, although many loci, including genes and single nucleotide polymorphisms (SNPs), have been identified as risk variants of T2D, only approximately 10% of its heritability can be explained. In the current study, we proposed a data processing and analysis procedure to more accurately evaluate the association of the pathogenesis of T2D with copy number variations (CNVs). Methods: The data in our study came from the WTCCC (Wellcome Trust Case Control Consortium) genome-wide CNV database. Individual CNVs were identified by SW-ARRAY and CBS algorithms and genotyped with a global threshold method. Overlapped CNVs among all samples were split into smaller but more accurate CNV segments (CNVSegs) after the CNV call; then, LASSO-based logistic regression models with 10-fold cross-validations were performed 100 times to examine the association of CNVSegs with T2D. The AUC (area under the curve) in every model was summarized to preliminarily verify the classification ability of the models. Results: After quality control, 1,813 T2D cases and 2,777 controls were enrolled in the study. A total of 65,163 CNVs were identified, of which 25,512 were identified in the T2D group and 39,651 were identified in the healthy control group. A total of 22,279 CNVSegs were constructed after pre-processing the raw CNV data. By means of fitting 1,000 logistic regression models with the LASSO method, 26 CNVSegs were identified as T2D-associated CNVSegs according to pre-defined criteria (Frequency > 85% & Length > = 50 bp). Twenty-seven protein-coding genes were found to be overlapped with the CNVSegs, of which 11 were verified to be relevant to T2D, obesity or metabolic syndrome based on current published evidence. The average AUC of all models was 0.611 with the maximum being 0.683. Conclusions: Our study explored T2D-associated CNVSegs by LASSO-logistic regression models from the perspective of the whole genome for a more complete understanding of the genetic mechanisms of T2D. Further studies are necessary to verify the influence of the susceptibility loci on the pathogenesis or progression of T2D among different populations.
What problem does this paper attempt to address?