Comparative analysis of metagenomic classifiers for long-read sequencing datasets

Josip Marić,Krešimir Križanović,Sylvain Riondet,Niranjan Nagarajan,Mile Šikić
DOI: https://doi.org/10.1186/s12859-024-05634-8
IF: 3.307
2024-01-13
BMC Bioinformatics
Abstract:Long reads have gained popularity in the analysis of metagenomics data. Therefore, we comprehensively assessed metagenomics classification tools on the species taxonomic level. We analysed kmer-based tools, mapping-based tools and two general-purpose long reads mappers. We evaluated more than 20 pipelines which use either nucleotide or protein databases and selected 13 for an extensive benchmark. We prepared seven synthetic datasets to test various scenarios, including the presence of a host, unknown species and related species. Moreover, we used available sequencing data from three well-defined mock communities, including a dataset with abundance varying from 0.0001 to 20% and six real gut microbiomes.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance evaluation of metagenomic classification tools in long - read sequencing datasets. Specifically, the researchers comprehensively evaluated the performance of k - mer - based methods, mapping - based methods, and two general long - read mappers at the species classification level. They used more than 20 pipelines for preliminary screening and selected 13 for detailed benchmarking. These tools classify metagenomic data using nucleotide or protein databases. The researchers prepared seven synthetic datasets to test different scenarios, including the presence of hosts, unknown species, and related species. In addition, three datasets of defined simulated communities and six real gut microbiome datasets were also used for testing. Through this series of tests, the researchers hope to understand the performance of different types of classification tools when processing long - read sequencing data, especially the accuracy and resource consumption of the tools under the influence of factors such as read length, database integrity, and abundance measurement definitions. The ultimate goal is to provide guidance for selecting metagenomic classification tools suitable for specific application scenarios, while pointing out the limitations of current tools and the direction for future improvement.