Genotype likelihoods incorporated in non-linear dimensionality reduction techniques infer fine-scale population genetic structure

F. Gözde Çilingir,Kerem Uzel,Christine Grossen
DOI: https://doi.org/10.1101/2024.04.01.587545
2024-04-01
Abstract:Understanding population structure is essential for conservation genetics, as it provides insights into population connectivity and supports the development of targeted strategies to preserve genetic diversity and adaptability. While Principal Component Analysis (PCA) is a common linear dimensionality reduction method in genomics, the utility of non-linear techniques like t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) for revealing population genetic structures has been largely investigated in humans and model organisms but less so in wild animals. Our study bridges this gap by applying UMAP and t-SNE, alongside PCA, to medium and low-coverage whole-genome sequencing data from the scimitar oryx, once extinct in the wild, and the Galápagos giant tortoises, facing various threats. By estimating genotype likelihoods from coverages as low as 0.5x, we demonstrate that UMAP and t-SNE outperform PCA in identifying genetic structure at reduced genomic coverage levels. This finding underscores the potential of these methods in conservation genomics, particularly when combined with cost-effective, low-coverage sequencing. We also provide detailed guidance on hyperparameter tuning and implementation, facilitating the broader application of these techniques in wildlife genetics research to enhance biodiversity conservation efforts.
Bioinformatics
What problem does this paper attempt to address?
The problem this paper attempts to address is how to use nonlinear dimensionality reduction techniques (such as t-SNE and UMAP) to infer fine-scale population genetic structure from low-coverage whole-genome sequencing data in conservation genetics. Specifically, the researchers aim to evaluate whether these nonlinear methods perform better than traditional linear methods (such as PCA) in identifying genetic structure, especially in the context of low-coverage sequencing data. ### Background and Problem 1. **Importance of Conservation Genetics**: - Understanding population structure is crucial for conservation genetics because it can reveal connections between populations and support the development of targeted strategies to protect genetic diversity and adaptive capacity. 2. **Limitations of Traditional Methods**: - Although PCA is a commonly used linear dimensionality reduction method, it has limitations when dealing with complex genomic data, such as potentially overlooking nonlinear relationships and variations in certain directions. 3. **Application of Nonlinear Methods**: - Nonlinear dimensionality reduction techniques (such as t-SNE and UMAP) have been widely used to reveal population genetic structure in humans and model organisms, but their application in wildlife is less common. 4. **Advantages of Low-Coverage Sequencing**: - Low-coverage whole-genome sequencing is a cost-effective method that can be as economical as reduced representation sequencing methods in many cases, making it suitable for large-scale population genomic screening. ### Research Objectives - **Evaluate the Performance of Nonlinear Methods**: Assess the ability of t-SNE and UMAP to identify genetic structure by applying these methods to medium and low-coverage whole-genome sequencing data, particularly in comparison to PCA. - **Optimize Parameter Settings**: Provide detailed hyperparameter tuning and implementation guidelines to facilitate the broader application of these techniques in wildlife genetic studies. - **Application Cases**: Demonstrate the effectiveness of these methods in practical conservation genetics using examples such as the Scimitar-horned oryx and Galápagos giant tortoises. ### Main Contributions - **Technical Advantages**: The study shows that t-SNE and UMAP outperform PCA in identifying genetic structure in low-coverage sequencing data. - **Practical Guidelines**: Detailed hyperparameter tuning and implementation guidelines are provided to aid the widespread application of these techniques in conservation genetics research. - **Conservation Significance**: The improvement and application of these methods are expected to better support the conservation of endangered species by identifying genetic isolation or unique populations, thereby formulating more effective conservation strategies.