Abstract 13172: Using Machine Learning to Identify Transcriptomic Biomarkers That Differentiate Early vs Late Stage of Atherosclerosis

Colin Price,James Hixson,Do-Kyun Kim,Richard Vander Heide,Yue Wang,Jennifer E Van Eyk,David M Herrington,Andrew Warren,Chunhong Mao,Colin PriceJames HixsonDo-Kyun KimRichard Vander HeideYue WangJennifer E Van EykDavid M HerringtonAndrew WarrenChunhong MaoBiocomplexity Institute and Initiative,Charlottesville,VAUTHealth Sch of Public Health,Houston,TXUTHealth,Sch of Public Health,Houston,TXLSU HEALTH SCIENCE CENTER,New Orleans,LACEDAR SINAI MEDICAL CENTER,Los Angeles,CAWAKE FOREST UNIV SCHOOL MEDICI,Winston Salem,NCUniv of Virginia,Biocomplexity Institute and Initiative,Charlottesville,VA
DOI: https://doi.org/10.1161/circ.144.suppl_1.13172
IF: 37.8
2021-11-10
Circulation
Abstract:Circulation, Volume 144, Issue Suppl_1, Page A13172-A13172, November 16, 2021. Introduction:We used a machine learning approach to explore the transcriptomic signaling component of atherosclerosis. This approach can be viewed as complementary to classical differential-expression-based RNA-Seq approach while defining some of its limitations and providing insight into the cellular basis of atherosclerosis.Methods:Abdominal aorta specimens (n=242) from 128 Coroner's autopsy cases were graded by pathologists and classified as normal (nl), fatty streak (fs), fibrous plaque (fp), or complex fibrous plaque (fc). The pathology samples were sorted into two groups for comparison: normal/early (nl/fs) vs late stage (fp/fc) of atherosclerosis. The RNA-Seq data were analyzed using an ensemble of machine learning methods to identify a set of genes that differentiate the normal/early and late pathology stages. Three feature selection algorithms (recursive feature elimination, random forest optimization and regularized linear regression) were employed to assign a total ranking to the importance of genes for pathology classification. Five different classifiers (Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine and Decision Tree) were trained and XGboost model was used for ensemble learning. We used the resulting performance characterization relative to the clinically established ground truth to validate the characteristic genes and abundances selected by this model.Results:The ensemble machine learning approach identified a set of gene features that best explain late versus normal/early stage of atherosclerosis in a 5-fold cross validation experimental design. XG-boosting shallow learners give stable performance across sample splits and more consistently high F1 scores. We found that the performance of the resulting model was highly correlated with the degree of the disease severity, indicating a relationship between concerted transcript abundance and the presentation of disease phenotype.Conclusions:Our machine learning approach identifies ensemble biomarkers that differentiate early vs late stage atherosclerosis. Our results also indicated a potential transcriptomic basis for the severity of disease phenotype as embodied by histopathology grading not included in the training data.
cardiac & cardiovascular systems,peripheral vascular disease
What problem does this paper attempt to address?