Abstract 6209: From data disparity to data harmony: A comprehensive pan-cancer omics data collection

Lea Meunier,Guillaume Appe,Abdelkader Behdenna,Valentin Bernu,Helia Brull Corretger,Prashant Dhillon,Eleonore Fox,Julien Haziza,Charles Lescure,Camille Marijon,Clemence Petit,Solene Weill,Akpeli Nordor
DOI: https://doi.org/10.1158/1538-7445.am2024-6209
IF: 11.2
2024-03-24
Cancer Research
Abstract:In cancer research, the exponential growth of omics datasets offers a significant opportunity for scientific advancement. However, challenges such as the lack of uniform standards, in both clinical and omic data, hinder the effective utilization of these datasets, thus impeding our understanding of cancer biology and the development of innovative therapeutic approaches.Addressing these challenges, we have created a novel collection of pan-cancer omics datasets with extensive clinical data harmonization and consistent omic data normalization.Here, we focused on patient-derived gene expression microarray datasets from the Gene Expression Omnibus database. To navigate the complexities presented by the diverse clinical descriptions inherent in these datasets, we leveraged our proprietary ontology, machine learning models, and domain expert quality control processes to homogenize the clinical data elements.Datasets were then selected based on sample composition, molecular data compatibility, and clinical data availability, then passed through a uniform preprocessing and normalization pipeline to maximize data quality. Finally, gene names were aligned on a single annotation reference, and potential batch effects were adjusted before expression data were merged together.We obtained a total of 32,825 transcriptomic sample profiles from 470 datasets, covering 13,435 genes and 45 clinical data elements, across 30 cancer types. Healthy tissue was favored over adjacent tissue, to minimize the risk of introducing biases related to cancer patient background genomic profiles into downstream analyses. We compared our collection with The Cancer Genome Atlas (TCGA), the most commonly used RNA-seq transcriptomic dataset in cancer research. It covers 30 out of the 33 TCGA cancer types, with on average 4.2 times more samples per cancer type ([0.3; 45.5], median 3.4). Despite the two data collections being based on distinct technologies, we observed a Pearson correlation of 0.69 over the 11,753 genes in common, and a 100% overlap of the differentially expressed genes between genders. This consistency highlights cross-technology reliability and complementarity.We have built and continuously enriched a comprehensive dataset collection enabling the secondary analysis of high-quality omic data. This initial work - focused on microarray datasets - allows us to streamline design, exploration and validation of various omics data-driven studies in cancer research.Our ongoing efforts involve not only the continued integration of microarray datasets but also the integration of pan-cancer RNA-seq and single-cell data. This initiative is set to expand further, encompassing a broader range of omics datasets in the future. Citation Format: Lea Meunier, Guillaume Appe, Abdelkader Behdenna, Valentin Bernu, Helia Brull Corretger, Prashant Dhillon, Eleonore Fox, Julien Haziza, Charles Lescure, Camille Marijon, Clemence Petit, Solene Weill, Akpeli Nordor. From data disparity to data harmony: A comprehensive pan-cancer omics data collection [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 6209.
oncology
What problem does this paper attempt to address?