SLEDGe: Inference of ancient whole genome duplications using machine learning

Brittany L. Sutherland,George P. Tiley,Zheng Li,Michael TW McKibben,Michael S. Barker
DOI: https://doi.org/10.1101/2024.01.17.574559
2024-01-18
Abstract:Ancient whole-genome duplication--previous genome duplication events that have since been eroded via diploidization, are increasingly identified throughout eukaryotes. One of the constraints against large-scale studies of ancient eukaryotic WGD is the relatively large, high-quality datasets often needed to definitively establish ancient WGD events; alternatively, the more low-input method interpretation of genome-wide synonymous substitution rates (Ks plots) is prone to bias and inconsistency. We improve upon the shortcomings of the current Ks plot method by building a Ks plot simulator. This data-agnostic approach simulates common distributions found in Ks plots in the presence or absence of ancient WGD signatures. In conjunction with a machine-learning classifier, this approach can quickly assess the likelihood that transcriptomic and genomic data bear WGD signatures. On independently-generated synthetic data and real plant transcriptomic data, SLEDGE is capable of correctly identifying ancient WGD in 93-100% of samples. This approach can serve as a quick classification step in large-scale genomic analyses, identifying putative ancient polyploids for further study.
Bioinformatics
What problem does this paper attempt to address?