Polaris: Polarization of ancestral and derived polymorphic alleles for inferences of extended haplotype homozygosity in human populations.

Alessandro Lisi,Michael C. Campbell
DOI: https://doi.org/10.1101/2024.12.06.627098
2024-12-08
Abstract:Summary: Statistical methods that measure the extent of haplotype homozygosity on chromosomes have been highly informative for identifying episodes of recent selection. For example, the integrated haplotype score (iHS) and the extended haplotype homozygosity (EHH) statistics detect long-range haplotype structure around derived and ancestral alleles indicative of classic and soft selective sweeps, respectively. However, to our knowledge, there are currently no publicly available methods that classify ancestral and derived alleles in genomic datasets for the purpose of quantifying the extent of haplotype homozygosity. Here, we introduce the Polaris package, which polarizes chromosomal variants into ancestral and derived alleles and creates corresponding genetic maps for analysis by selscan and HaploSweep, two versatile haplotype-based programs that perform scans for selection. With the input files generated by Polaris, selscan and/or HaploSweep can produce the appropriate sign (either positive or negative) to outlier iHS statistics, enabling users to distinguish between selection on derived or ancestral alleles. In addition, Polaris can convert the numerical output of these analyses into graphical representations of selective sweeps, increasing the functionality of our software. Results: To demonstrate the utility of our approach, we applied the Polaris package to Chromosome 2 in the European Finnish population from the 1000 Genomes Project. More specifically, we examined the regulatory region in intron 13 of MCM6 associated with lactase persistence (i.e., the ability to digest the lactose sugar present in fresh milk), a region of intense interest to human evolutionary geneticists. Our analyses showed that the derived T-13910 allele (a known enhancer for lactase expression), sits on an extended haplotype background in the Finnish consistent with a classic selective sweep model as determined by iHS and EHH statistics calculated by selscan and HaploSweep. Importantly, we were able to immediately identify this target allele under selection based on the information generated by our software. We also explored outlier statistics across Chromosome 2 in two distinct datasets: i) one containing polarized alleles generated with Polaris and ii) the other containing unpolarized alleles in the original phased vcf file. Here, we found a significant excess of outlier statistics (P < 0.0001) in the unpolarized dataset, raising the possibility that a subset of these "hits" of selection on Chromosome 2 may be false positives. Overall, Polaris is a versatile package that enables users to efficiently explore, interpret, and report signals of recent selection in genomic datasets. Availability and implementation: The Polaris package is free and open source on GitHub (https://github.com/alisi1989/Polaris) and on DropBox (https://www.dropbox.com/scl/fo/mlxizft5267vem9u62qkn/AAnM0qX923zPzQBlPX8iteM?rlkey=uezrp4t2waffpj0nmo1evr320&e=1&st=jaodccws&dl=0). Contact: alisi@usc.edu; mc44680@usc.edu. keywords: ancestral alleles, derived alleles, natural selection, integrated haplotype score (iHS), extended haplotype homozygosity (EHH)
Bioinformatics
What problem does this paper attempt to address?