Improved Personalized Survival Prediction of Patients With Diffuse Large B-cell Lymphoma Using Gene Expression Profiling

Adrián Mosquera Orgueira,José Ángel Díaz Arias,Miguel Cid López,Andres Peleteiro Raindo,Beatriz Antelo Rodriguez,Carlos Aliste Santos,Natalia Alonso Vence,Angeles Bendaña Lopez,Aitor Abuin Blanco,Laura Bao Perez,Marta Sonia Gonzalez Perez,Manuel Mateo Perez Encinas,Maximo Francisco Fraga Rodriguez,Jose Luis Bello Lopez
DOI: https://doi.org/10.21203/rs.3.rs-40793/v1
2020-07-10
Abstract:Abstract Background 30-40% of patients with Diffuse Large B-cell Lymphoma (DLBCL) have an adverse clinical evolution. The increased understanding of DLBCL biology has shed light on the clinical evolution of this pathology, leading to the discovery of prognostic factors based on gene expression data, genomic rearrangements and mutational subgroups. Nevertheless, additional efforts are needed in order to enable survival predictions at the patient level. This study investigated new machine learning models of survival based on transcriptomic and clinical data. Methods Gene expression profiling (GEP) in 2 different publicly available retrospective cohorts were analyzed. Cox regression and unsupervised clustering were performed in order to identify probes associated with overall survival on the largest cohort. Random forests were created to model survival using combinations of GEP data, COO classification and clinical information. Cross-validation was used to compare model results in the training set, and Harrel’s concordance index (c-index) was used to assess model’s predictability. Results were validated in an independent test set. Results 233 and 64 patients were included in the training and test set, respectively. Initially we derived and validated a 4-gene expression clusterization that was independently associated with lower survival in 20% of patients. These genes were TNFRSF9 , BIRC3 , BCL2L1 and G3BP2 . Thereafter, we applied machine-learning models to predict survival. A set of 102 genes was highly predictive of disease outcome, outperforming available clinical information and COO classification. The final best model integrated clinical information, COO classification, 4-gene-based clusterization and 50 gene expression data (training set c-index, 0.8404, test set c-index, 0.7942). Conclusion This study indicates that modelling DLBCL survival with transcriptomic-based machine learning algorithms can largely outperform other important prognostic variables such as disease stage and COO.
What problem does this paper attempt to address?