A Community Effort to Identify and Correct Mislabeled Samples in Proteogenomic Studies
Seungyeul Yoo,Zhiao Shi,Bo Wen,SoonJye Kho,Renke Pan,Hanying Feng,Hong Chen,Anders Carlsson,Patrik Eden,Weiping Ma,Michael Raymer,Ezekiel J. Maier,Zivana Tezak,Elaine Johanson,Denise Hinton,Henry Rodriguez,Jun Zhu,Emily Boja,Pei Wang,Bing Zhang
DOI: https://doi.org/10.1016/j.patter.2021.100245
2021-01-01
Patterns
Abstract:Sample mislabeling or misannotation has been a long-standing problem in scientific research, particularly prevalent in large-scale, multi-omic studies due to the complexity of multi-omic workflows. There exists an urgent need for implementing quality controls to automatically screen for and correct sample mislabels or misannotations in multi-omic studies. Here, we describe a crowdsourced precisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge, which provides a framework for systematic benchmarking and evaluation of mislabel identification and correction methods for integrative proteogenomic studies. The challenge received a large number of submissions from domestic and international data scientists, with highly variable performance observed across the submitted methods. Post-challenge collaboration between the top-performing teams and the challenge organizers has created an open-source software, COSMO, with demonstrated high accuracy and robustness in mislabeling identification and correction in simulated and real multi-omic datasets.