Exploration and verification a 13-gene diagnostic framework for ulcerative colitis across multiple platforms via machine learning algorithms

Jing Wang,Lin Li,Pingbo Chen,Chiyi He,Xiaoping Niu
DOI: https://doi.org/10.1038/s41598-024-65481-8
2024-07-01
Abstract:Ulcerative colitis (UC) is a chronic inflammatory bowel disease with intricate pathogenesis and varied presentation. Accurate diagnostic tools are imperative to detect and manage UC. This study sought to construct a robust diagnostic model using gene expression profiles and to identify key genes that differentiate UC patients from healthy controls. Gene expression profiles from eight cohorts, encompassing a total of 335 UC patients and 129 healthy controls, were analyzed. A total of 7530 gene sets were computed using the GSEA method. Subsequent batch correction, PCA plots, and intersection analysis identified crucial pathways and genes. Machine learning, incorporating 101 algorithm combinations, was employed to develop diagnostic models. Verification was done using four external cohorts, adding depth to the sample repertoire. Evaluation of immune cell infiltration was undertaken through single-sample GSEA. All statistical analyses were conducted using R (Version: 4.2.2), with significance set at a P value below 0.05. Employing the GSEA method, 7530 gene sets were computed. From this, 19 intersecting pathways were discerned to be consistently upregulated across all cohorts, which pertained to cell adhesion, development, metabolism, immune response, and protein regulation. This corresponded to 83 unique genes. Machine learning insights culminated in the LASSO regression model, which outperformed others with an average AUC of 0.942. This model's efficacy was further ratified across four external cohorts, with AUC values ranging from 0.694 to 0.873 and significant Kappa statistics indicating its predictive accuracy. The LASSO logistic regression model highlighted 13 genes, with LCN2, ASS1, and IRAK3 emerging as pivotal. Notably, LCN2 showcased significantly heightened expression in active UC patients compared to both non-active patients and healthy controls (P < 0.05). Investigations into the correlation between these genes and immune cell infiltration in UC highlighted activated dendritic cells, with statistically significant positive correlations noted for LCN2 and IRAK3 across multiple datasets. Through comprehensive gene expression analysis and machine learning, a potent LASSO-based diagnostic model for UC was developed. Genes such as LCN2, ASS1, and IRAK3 hold potential as both diagnostic markers and therapeutic targets, offering a promising direction for future UC research and clinical application.
What problem does this paper attempt to address?