Low LRs obtained from DNA mixtures: On calibration and discrimination performance of probabilistic genotyping software

Moya McCarthy-Allen,Oyvind Bleka,Rolf Ypma,Peter Gill,Corina C.G. Benschop
DOI: https://doi.org/10.1101/2024.06.06.597689
2024-06-06
Abstract:The validity of a probabilistic genotyping (PG) system is typically demonstrated by following international guidelines for the developmental and internal validation of PG software. These guidelines mainly focus on discriminatory power. Very few studies have reported with metrics that depend on calibration of likelihood ratio (LR) systems. In this study, discriminatory power as well as various calibration metrics, such as Empirical Cross-Entropy (ECE) plots, pool adjacent violator (PAV) plots, log likelihood ratio cost (Cllr and Cllrcal), fiducial calibration discrepancy plots, and Turing expectation were examined using the publicly-available PROVEDIt dataset. The aim was to gain deeper insight into the performance of a variety of PG software in the lower LR ranges (~LR 1-10,000), with focus on DNAStatistX and EuroForMix which use maximum likelihood estimation (MLE). This may be a driving force for the end users to reconsider current LR thresholds for reporting. In previous studies, overstated low LRs were observed for these PG software. However, applying (arbitrarily) high LR thresholds for reporting wastes relevant evidential value. This study demonstrates, based on calibration performance, that previously reported LR thresholds can be lowered or even discarded. Considering LRs >1, there was no evidence for miscalibration performance above LR ~1,000 when using Fst 0.01. Below this LR value, miscalibration was observed. Calibration performance generally improved with the use of Fst 0.03, but the extent of this was dependent on the dataset: results ranged from miscalibration up to LR ~100 to no evidence of miscalibration alike PG software using different methods to model peak height, HMC and STRmix. This study demonstrates that practitioners using MLE-based models should be careful when low LR ranges are reported, though applying arbitrarily high LR thresholds is discouraged. This study also highlights various calibration metrics that are useful in understanding the performance of a PG system.
Genetics
What problem does this paper attempt to address?