Abstract:Defining the population structure of a pathogen is a key part of epidemiology, as genomically related isolates are likely to share key clinical features such as antimicrobial resistance profiles and invasiveness. Multiple different methods are currently used to cluster together closely-related genomes, potentially leading to inconsistency between studies. Here, we use a global dataset of 26,306 S. pneumoniae genomes to compare four clustering methods: gene-by-gene seven-locus multi-locus sequencing typing (MLST), core genome MLST (cgMLST)-based hierarchical clustering (HierCC) assignments, Life Identification Number (LIN) barcoding, and k-mer-based PopPUNK clustering (known as GPSCs in this species). We compare the clustering results with phylogenetic and pan-genome analyses to assess their relationship with genome diversity and evolution, as we would expect a good clustering method to form a single monophyletic cluster that has high within-cluster similarity of genomic content. We show that the four methods are generally able to accurately reflect the population structure based on these metrics, and that the methods were broadly consistent with each other. We investigated further to study the discrepancies in clusters. The greatest concordance was seen between LIN barcoding and HierCC (Adjusted Mutual Information Score = 0.950), which was expected given that both methods utilise cgMLST, but have different methods for defining an individual cluster and different core genome schema. However, the existence of differences between the two methods show that the selection of a core genome schema can introduce inconsistencies between studies. GPSC and HierCC assignments were also highly concordant (AMI = 0.946), showing that k-mer based methods which use the whole genome and do not require the careful selection of a core genome schema are just as effective at representing the population structure. Additionally, where there were differences in clustering between these methods, this could be explained by differences in the accessory genome that were not identified in cgMLST. We conclude that for S. pneumoniae, standardised and stable nomenclature is important as the number of genomes available expands. Furthermore, the research community should transition away from seven-locus MLST, and cgMLST, GPSC, and LIN assignments should be used more widely. However, to allow for easy comparison between studies and to make previous literature relevant, the reporting of multiple clustering names should be standardised within research.

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Comparison of different genotyping methods**: The study compares four clustering methods used to define the global population structure of *Streptococcus pneumoniae*, including the traditional seven-locus multilocus sequence typing (MLST) based on gene-by-gene analysis, core genome multilocus sequence typing (cgMLST) and its hierarchical clustering (HierCC), Life Identification Number (LIN) barcoding, and the k-mer-based method (PopPUNK/GPSC). 2. **Evaluation of consistency among different methods**: By comparing these clustering results with phylogenetic trees and pangenome analysis, the study evaluates their performance in reflecting genomic diversity and evolutionary relationships, and explores the differences between the methods. 3. **Standardized naming system**: The paper suggests that for *Streptococcus pneumoniae*, the traditional seven-locus MLST method should be gradually phased out in favor of methods like cgMLST, GPSC, and LIN barcoding. It emphasizes the need to standardize the reporting of multiple clustering names in research to maintain consistency. 4. **Addressing the limitations of existing methods**: The paper points out some limitations of the traditional MLST method, such as the inability to assign sequence types (ST) due to gene deletions or interruptions, over-clustering caused by high recombination rates, and the lack of resolution for closely related isolates. By comparing new methods, the study hopes to find more accurate and consistent clustering solutions.

Comparison of gene-by-gene and genome-wide short nucleotide sequence based approaches to define the global population structure of Streptococcus pneumoniae

Comparison of gene-by-gene and genome-wide short nucleotide sequence-based approaches to define the global population structure of Streptococcus pneumoniae

Evolution of single-gene molecule of Klebsiella pneumoniae and systematic study on multilocus sequence typing and genome

Emerging Challenges of Whole-Genome-sequencing–powered Epidemiological Surveillance of Globally Distributed Clonal Groups of Bacterial Infections, Giving Acinetobacter Baumannii ST195 As an Example

High diversity within and low but significant genetic differentiation among geographic and temporal populations of the global Streptococcus pneumoniae

Development of the Pneumococcal Genome Library, a core genome multilocus sequence typing scheme, and a taxonomic life identification number barcoding system to investigate and define pneumococcal population structure

Comparative genomic analyses of seventeen Streptococcus pneumoniae strains: insights into the pneumococcal supragenome.

Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species

Defining and Evaluating a Core Genome Multilocus Sequence Typing Scheme for Whole-Genome Sequence-Based Typing of Klebsiella Pneumoniae

Serotype and MLST-based Inference of Population Structure of Clinical Streptococcus Pneumoniae from Invasive and Noninvasive Pneumococcal Disease.

Population Structure and Minimum Core Genome Typing of Legionella Pneumophila

Table S8 from Enterobase: Hierarchical Clustering of 100,000 S of Bacterial Genomes into Species/sub-Species and Populations

Novel Multilocus Sequence Typing and Global Sequence Clustering Schemes for Characterizing the Population Diversity of Streptococcus mitis

Genetic Analyses of Multidrug-Resistant Streptococcus pneumoniae Serogroup 19 CC320/271 Clone in China

Comparative Genomic Analysis of Multidrug-Resistant Streptococcus Pneumoniae Isolates

Genomic Analyses of >3,100 Nasopharyngeal Pneumococci Revealed Significant Differences Between Pneumococci Recovered in Four Different Geographical Regions

EnteroBase: hierarchical clustering of 100 000s of bacterial genomes into species/subspecies and populations

Global genomic profiling of Klebsiella pneumoniae: a spatio-temporal population structure analysis

EnteroBase: Hierarchical Clustering of 100,000s of Bacterial Genomes into Species/sub-Species and Populations

Epidemiology of Staphylococcus aureus food isolates: Comparison of conventional methods with whole genome sequencing typing methods

Large-scale comparative genomics to refine the organization of the global Salmonella enterica population structure