Machine Learning and Metabolomics: Diagnosis of Malignant Breast Cancer

Andrew Winnicki,Yujia Qin,Mayumi Jijiwa,Masaki Nasu,Yuanyuan Fu,Youping Deng
DOI: https://doi.org/10.1096/fasebj.2021.35.s1.05115
2021-01-01
Abstract:Breast cancer is one of the leading causes of death among women in the US, and early diagnosis of the disease contributes significantly to the survival rates of patients. Metabolomics data, which can be obtained in a non-invasive manner through blood plasma sampling, coupled with machine learning methods, has been increasingly applied to the field of biology, opening up a novel dimension of quantitative analysis that has the potential to unlock the secrets to highly complex diseases such as cancer. In this study, machine learning methods were leveraged to analyze several hundred metabolomic signatures of breast cancer patients towards discrimination between malignant and benign tumors. Samples of plasma were taken from 100 female breast cancer patients (50 from malignant and 50 from benign tumors) and metabolite molecules such as lipids and bile acids were extracted using ultra-performance liquid chromatography coupled with tandem mass spectrometry. The data were then log-transformed, and metabolites with a significant log-fold-change were selected for dimensionality reduction using a Wilcoxon test. These data where then analyzed using an ensemble of 8 different models: decision tree, gradient boosting, random forest, elastic net, linear discriminant analysis, support vector machine, nearest shrunken centroids, and a neural network. The accuracy was evaluated using the ROC cuve and AUC. Variations on these models were included in analysis, including principal component transformations. The most successful model was built using recursive feature selection and a logistic regression model with L1 regularization, which had an accuracy score of 95.0% and a biomarker panel of 42 assembled metabolites from the original 358. With the Wilcoxon-selected feature set of 62 metabolites, gradient boosting and random forest algorithms also returned accuracy scores of up to 90% and with an AUC of 0.98. Future refinements on these models could eventually lead to the use of the biomarkers in a clinical setting, helping to improve survival rates through early detection of malignant breast cancer tumors.
What problem does this paper attempt to address?