Abstract:Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. We address this problem by converting the SOPs into a downloadable and executable format. Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and easily executed form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.

Unifying cancer and normal RNA sequencing data from different sources

Improving the Diversity of Captured Full-Length Isoforms Using a Normalized Single-Molecule RNA-sequencing Method

An Integrated Pipeline For Tcga Data Analysis

Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows

GeoTyper: Automated Pipeline from Raw scRNA-Seq Data to Cell Type Identification

Abstract 5275: an Integrated Pipeline for TCGA Data Analysis

Pan-cancer study of heterogeneous RNA aberrations

Reproducible processing of TCGA regulatory networks

Systematic Characterization of Cancer Transcriptome at Transcript Resolution

Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction

Pan-cancer discovery of somatic mutations from RNA sequencing data

Cross-Site Concordance Evaluation of Tumor DNA and RNA Sequencing Platforms for the CIMAC-CIDC Network

Pipeline for RNA sequencing data analysis by combination of Nextflow and R

A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples

A pipeline for RNA-seq based eQTL analysis with automated quality control procedures

A Novel Multi-Alignment Pipeline for High-Throughput Sequencing Data.

Fused inverse-normal method for integrated differential expression analysis of RNA-seq data

Large-scale profiling of microRNAs for The Cancer Genome Atlas

R code and downstream analysis objects for the scRNA-seq atlas of normal and tumorigenic human breast tissue

Prudent Application of Single-Cell RNA Sequencing in Understanding Cellular Features and Functional Phenotypes in Cancer Studies

A single cell RNAseq benchmark experiment embedding "controlled" cancer heterogeneity