Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data

Galen F Gao, Joel S Parker, Sheila M Reynolds, Tiago C Silva, Liang-Bo Wang, Wanding Zhou, Rehan Akbani, Matthew Bailey, Saianand Balu, Benjamin P Berman, Denise Brooks, Hu Chen, Andrew D Cherniack, John A Demchok, Li Ding, Ina Felau, Sharon Gaheen, Daniela S Gerhard, David I Heiman, Kyle M Hernandez, Katherine A Hoadley, Reyka Jayasinghe, Anab Kemal, Theo A Knijnenburg, Peter W Laird, Michael KA Mensah, Andrew J Mungall, A Gordon Robertson, Hui Shen, Roy Tarnuzzer, Zhining Wang, Matthew Wyczalkowski, Liming Yang, Jean C Zenklusen, Zhenyu Zhang, Han Liang, Michael S Noble
2019-07-24
Abstract:We present a systematic analysis of the effects of synchronizing a large-scale, deeply characterized, multi-omic dataset to the current human reference genome, using updated software, pipelines, and annotations. For each of 5 molecular data platforms in The Cancer Genome Atlas (TCGA)—mRNA and miRNA expression, single nucleotide variants, DNA methylation and copy number alterations—comprehensive sample, gene, and probe-level studies were performed, towards quantifying the degree of similarity between the ‘legacy' GRCh37 (hg19) TCGA data and its GRCh38 (hg38) version as ‘harmonized' by the Genomic Data Commons. We offer gene lists to elucidate differences that remained after controlling for confounders, and strategies to mitigate their impact on biological interpretation. Our results demonstrate that the hg19 and hg38 TCGA datasets are very highly concordant, promote informed use of …
What problem does this paper attempt to address?