Deep generative AI models analyzing circulating orphan non-coding RNAs enable accurate detection of early-stage non-small cell lung cancer
Mehran Karimzadeh,Amir Momen-Roknabadi,Taylor B. Cavazos,Yuqi Fang,Nae-Chyun Chen,Michael Multhaup,Jennifer Yen,Jeremy Ku,Jieyang Wang,Xuan Zhao,Philip Murzynowski,Kathleen Wang,Rose Hanna,Alice Huang,Diana Corti,Dang Nguyen,Ti Lam,Seda Kilinc,Patrick Arensdorf,Kimberly H. Chau,Anna Hartwig,Lisa Fish,Helen Li,Babak Behsaz,Olivier Elemento,James Zou,Fereydoun Hormozdiari,Babak Alipanahi,Hani Goodarzi
DOI: https://doi.org/10.1101/2024.04.09.24304531
2024-04-12
Abstract:Liquid biopsies have the potential to revolutionize cancer care through non-invasive early detection of tumors, when the disease can be more effectively managed and cured. Developing a robust liquid biopsy test requires collecting high-dimensional data from a large number of blood samples across heterogeneous groups of patients. We propose that the generative capability of variational auto-encoders enables learning a robust and generalizable signature of blood-based biomarkers that capture true biological signals while removing spurious confounders (e.g., library size, zero-inflation, and batch effects). In this study, we analyzed orphan non-coding RNAs (oncRNAs) from serum samples of 1,050 individuals diagnosed with non-small cell lung cancer (NSCLC) at various stages, as well as sex-, age-, and BMI-matched controls to evaluate the potential use of deep generative models. We demonstrated that our multi-task generative AI model, Orion, surpassed commonly used methods in both overall performance and generalizability to held-out datasets. Orion achieved an overall sensitivity of 92% (95% CI: 85%–97%) at 90% specificity for cancer detection across all stages, outperforming the sensitivity of other methods such as support vector machine (SVM) classifier, ElasticNet, or XGBoost on held-out validation datasets by more than ∼30%.
Oncology