Sample Size Impact (SaSii): an R script for estimating optimal sample sizes in population genetics and population genomics studies

Matheus Scaketti,Patricia Sanae Sujii,Alessandro Alves-Pereira,Kaiser Dias Schwarcz,Ana Flávia Francisconi,Matheus Sartori Moro,Kauanne Karolline Moreno Martins,Thiago Araujo de Jesus,Guilherme Brenner Ferreira de Souza,Maria Imaculada Zucchi
DOI: https://doi.org/10.1101/2024.08.30.610119
2024-09-01
Abstract:Obtaining large sample sizes for genetic studies can be challenging, time-consuming, and expensive. However, small sample sizes may generate biased or imprecise results. Many studies have suggested the minimum sample size necessary to obtain robust and reliable results, but it is not possible to define one ideal minimum sample size that fits all studies. Here, we aim to present SaSii (Sample Size Impact), a R script to help researchers to define minimum sample size and to indicate minimum sample size patterns for some taxa groups. The patterns were obtained by analyzing previously published datasets with SaSii and can be used as a starting point for the sample design of population genetics and genomic studies. Our results showed that it is possible to estimate an adequate sample size that accurately represents the real population and does not require the scientist to take time to write any program code, extract and sequence samples or use population genetics programs, making it easier to gather this information. We also confirmed that sample sizes of five to twenty-five for SNP and fifteen to thirty for SSR can be used for most plant species, giving a better direction for new studies.
Bioinformatics
What problem does this paper attempt to address?
The paper aims to address the issue of determining the optimal sample size in population genetics and population genomics studies. Specifically, the authors developed an R script named SaSii (Sample Size Impact) to help researchers determine the minimum sample size and provide minimum sample size patterns for certain taxa. By analyzing published datasets, SaSii can serve as a starting point for designing sample strategies in population genetics and genomics studies. The results indicate that for most plant species, the optimal sample size for SNP data is 5 to 25 individuals, and for SSR data, the optimal sample size is 15 to 30 individuals. Additionally, the paper validates the effectiveness of SaSii for different types of molecular markers (such as SSR and SNP) and finds that the sample size required for SNP data is generally smaller than that for SSR data. These findings help researchers better plan sample designs when conducting population genetics studies.