Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows

Ling-Hong Hung,Bryce Fukuda,Robert Schmitz,Varik Hoang,Wes Lloyd,Ka Yee Yeung
DOI: https://doi.org/10.1101/2022.08.11.503660
2024-02-23
Abstract:Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. We address this problem by converting the SOPs into a downloadable and executable format. Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and easily executed form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.
Bioinformatics
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Data Integration and Standardization**: Although the cancer genomic datasets in the National Cancer Institute (NCI) Genomic Data Commons (GDC) are widely available, there are differences in processing methods between different datasets. This results in users needing to process their new data in the same way to obtain effective comparative results during comprehensive analysis. However, existing standard operating procedures (SOPs) are usually published in text form, making it difficult to accurately reproduce the entire computational process in detail. 2. **Reproducibility and Interactivity of Workflows**: Converting the GDC's DNA sequencing (DNA-Seq) and mRNA sequencing (mRNA-Seq) standard operating procedures into downloadable, executable, containerized workflows with graphical interfaces. In this way, researchers can uniformly process data from different data repositories and improve the consistency and accuracy of data interpretation. 3. **Importance of Data Homogenization**: Demonstrating the importance of uniformly processing RNA-Seq data from the GDC and the Genotype-Tissue Expression (GTEx) project, illustrating that homogenized processing of control and experimental data is crucial for accurately identifying differentially expressed genes. 4. **Dynamic Solutions**: Proposing a dynamic solution that allows researchers to continuously reprocess raw data as methods, versions, and supporting data are updated, thereby maintaining consistency and reliability in analysis. This approach not only improves the utilization of data resources but also enhances the integration capability across multiple data sources.