STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep Centers

Ryan J. Urbanowicz,Harsh Bandhey,Brendan T. Keenan,Greg Maislin,Sy Hwang,Danielle L. Mowery,Shannon M. Lynch,Diego R. Mazzotti,Fang Han,Qing Yun Li,Thomas Penzel,Sergio Tufik,Lia Bittencourt,Thorarinn Gislason,Philip de Chazal,Bhajan Singh,Nigel McArdle,Ning-Hung Chen,Allan Pack,Richard J. Schwab,Peter A. Cistulli,Ulysses J. Magalang
DOI: https://doi.org/10.48550/arXiv.2312.05461
2023-12-09
Abstract:While machine learning (ML) includes a valuable array of tools for analyzing biomedical data, significant time and expertise is required to assemble effective, rigorous, and unbiased pipelines. Automated ML (AutoML) tools seek to facilitate ML application by automating a subset of analysis pipeline elements. In this study we develop and validate a Simple, Transparent, End-to-end Automated Machine Learning Pipeline (STREAMLINE) and apply it to investigate the added utility of photography-based phenotypes for predicting obstructive sleep apnea (OSA); a common and underdiagnosed condition associated with a variety of health, economic, and safety consequences. STREAMLINE is designed to tackle biomedical binary classification tasks while adhering to best practices and accommodating complexity, scalability, reproducibility, customization, and model interpretation. Benchmarking analyses validated the efficacy of STREAMLINE across data simulations with increasingly complex patterns of association. Then we applied STREAMLINE to evaluate the utility of demographics (DEM), self-reported comorbidities (DX), symptoms (SYM), and photography-based craniofacial (CF) and intraoral (IO) anatomy measures in predicting any OSA or moderate/severe OSA using 3,111 participants from Sleep Apnea Global Interdisciplinary Consortium (SAGIC). OSA analyses identified a significant increase in ROC-AUC when adding CF to DEM+DX+SYM to predict moderate/severe OSA. A consistent but non-significant increase in PRC-AUC was observed with the addition of each subsequent feature set to predict any OSA, with CF and IO yielding minimal improvements. Application of STREAMLINE to OSA data suggests that CF features provide additional value in predicting moderate/severe OSA, but neither CF nor IO features meaningfully improved the prediction of any OSA beyond established demographics, comorbidity and symptom characteristics.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the following key problems: 1. **Challenges in the application of Automated Machine Learning (AutoML) in biomedicine**: - The paper develops and validates an automated machine - learning pipeline named STREAMLINE to simplify and optimize the machine - learning process in biomedical data analysis. Traditionally, building effective, rigorous, reproducible, and unbiased machine - learning pipelines requires a great deal of time and expertise, and STREAMLINE aims to reduce these requirements by automating part of the analysis process. 2. **The effectiveness of using photographic phenotypes to predict obstructive sleep apnea (OSA)**: - OSA is a common but under - diagnosed disease, associated with multiple health, economic, and safety issues. The paper uses datasets from international sleep centers to evaluate whether craniofacial and oral anatomical features (CF and IO) based on photography can improve the predictive ability for OSA. Specifically, the study explores the impact of adding photographic phenotype features on predicting "moderate - to - severe OSA" (AHI ≥15 events/hour) and "any OSA" (AHI ≥5 events/hour) based on existing demographic data (DEM), self - reported comorbidities (DX), and symptoms (SYM). 3. **Validating the effectiveness and performance of STREAMLINE**: - To ensure the effectiveness of STREAMLINE, researchers benchmarked it on multiple real and simulated datasets. These tests include the hepatocellular carcinoma (HCC) survival dataset, the simulated genomic dataset, the multiplexer dataset, and the XOR dataset. Through these tests, researchers verified STREAMLINE's ability to handle complex associations and heterogeneous data. ### Main conclusions - **Benchmarking results**: STREAMLINE performs well when dealing with datasets with complex association patterns, such as the simulated genomic dataset and the multiplexer dataset, validating its effectiveness and reliability. - **OSA prediction results**: Adding photographic phenotype features (CF) significantly improves the prediction performance for "moderate - to - severe OSA" (with a significant increase in ROC - AUC), but for predicting "any OSA", the addition of CF and IO features does not bring significant improvement. ### Formula presentation The formulas involved in the paper mainly include statistical significance tests and model evaluation metrics, such as ROC - AUC and PRC - AUC. Here are the Markdown - format representations of these formulas: - **ROC - AUC (Receiver Operating Characteristic Area Under the Curve)**: \[ \text{ROC - AUC}=\int_{0}^{1}\text{TPR}(FPR)\,dFPR \] where TPR is the True Positive Rate and FPR is the False Positive Rate. - **PRC - AUC (Precision - Recall Curve Area Under the Curve)**: \[ \text{PRC - AUC}=\int_{0}^{1}\text{Precision}(Recall)\,dRecall \] where Precision is the precision rate and Recall is the recall rate. Through these methods, the paper demonstrates the potential of STREAMLINE in biomedical data analysis and provides strong support for its effectiveness in practical applications.