assessPool: a fexible pipeline for population genomic analyses of pooled sequencing data

Evan B Freel,Emily E Conklin,Derek W Kraft,Jonathan L Whitney,Ingrid SS Knapp,Zac H Forsman,Robert J Toonen
DOI: https://doi.org/10.1101/2024.10.09.617480
2024-10-10
Abstract:Despite the dramatic decrease in high-throughput sequencing costs over time, sequencing the ideal number of individuals for population genetic inference remains prohibitively expensive. When research questions require only population-level resolution, pooling individual samples before sequencing (pool-seq) can substantially reduce costs while still providing allele frequencies of Single Nucleotide Polymorphisms (SNPs). However, analyzing pooled data is comparatively difficult and less standardized than individual-based analyses. Although several programs have been developed to handle pool-seq data, most require extensive formatting or programming skills to operate. Here we introduce assessPool, an open-source R and Bash pipeline for pool-seq analyses with a focus on population structure. AssessPool accepts a Variant-Call Format (VCF) file and a FASTA-formatted reference, providing a straightforward transition from commonly used pipelines such as Stacks or dDocent. AssessPool handles varying numbers of pools and utilizes PoPoolation2 to generate locus-by-locus pairwise F values and associated Fisher T-test values as measures of population structure. Starting with a VCF file containing all identified SNPs, assessPool facilitates several key functionalities for population genetic analyses: i) filtering SNPs based on adjustable criteria with parameter suggestions for pool-seq data, ii) organizing data structures for analysis based on input pools, iii) creating customizable run scripts for FST calculations using PoPoolation2 and/or the {poolfstat} R package, for all pairwise comparisons, iv) calculating locus-specific F values using PoPoolation2 and/or {poolfstat}, v) importing F output into a format compatible with R, vi) producing population genomic summary statistics, and vii) generating interactive plots to visualize and explore data. A pooled dataset generated from wild populations is used here to showcase the features of the assessPool pipeline for population genomic analyses.
Bioinformatics
What problem does this paper attempt to address?