A robust benchmark for detecting low-frequency variants in the HG002 Genome In A Bottle NIST reference material.

Camille Daniels,Adetola Abdulkadir,Megan H. Cleveland,Jennifer H. McDaniel,David Jaspez,Luis Alberto Rubio-Rodriguez,Adrian Munoz-Barrera,Jose Miguel Lorenzo-Salazar,Carlos Flores,Byunggil Yoo,Sayed Mohammad Ebrahim Sahraeian,Yina Wang,Massimiliano Rossi,Arun Visvanath,Lisa Murray,Wei-Ting Chen,Severine Catreux,James Han,Rami Mehio,Gavin Parnaby,Andrew Carroll,Pi-Chuan Chang,Kishwar Shafin,Daniel E. Cook,Alexey Kolesnikov,Lucas Brambrink,Mohammed Faizal Eeman Mootor,Yash Patel,Takafumi N. Yamaguchi,Paul C. Boutros,Karolina Sienkiewicz,Jonathan Foox,Christopher E. Mason,Bryan Lajoie,Carlos A. Ruiz-Perez,Semyon Kruglyak,Justin M. Zook,Nathan D. Olson
DOI: https://doi.org/10.1101/2024.12.02.625685
2024-12-05
Abstract:Somatic mosaicism is an important cause of disease, but mosaic and somatic variants are often challenging to detect because they exist in only a fraction of cells. To address the need for benchmarking subclonal variants in normal cell populations, we developed a benchmark containing mosaic variants in the Genome in a Bottle Consortium (GIAB) HG002 reference material DNA from a large batch of a normal lymphoblastoid cell line. First, we used a somatic variant caller with high coverage (300x) Illumina whole genome sequencing data from the Ashkenazi Jewish trio to detect variants in HG002 not detected in at least 5% of cells from the combined parental data. These candidate mosaic variants were subsequently evaluated using >100x BGI, Element, and PacBio HiFi data. High confidence candidate SNVs with variant allele fractions above 5% were included in the HG002 draft mosaic variant benchmark, with 13/85 occurring in medically relevant gene regions. We also delineated a 2.45 Gbp subset of the previously defined germline autosomal benchmark regions for HG002 in which no additional mosaic variants >2% exist, enabling robust assessment of false positives. The variant allele fraction of some mosaic variants is different between batches of cells, so using data from the homogeneous batch of reference material DNA is critical for benchmarking these variants. External validation of this mosaic benchmark showed it can be used to reliably identify both false negatives and false positives for a variety of technologies and detection algorithms, demonstrating its utility for optimization and validation. By adding our characterization of mosaic variants in this widely-used cell line, we support extensive benchmarking efforts using it in simulation, spike-in, and mixture studies.
Bioinformatics
What problem does this paper attempt to address?