Comparison between ribosomal assembly and machine learning tools for microbial identification of organisms with different characteristics

Stephanie Chau,Carlos Rojas,Jorjeta G. Jetcheva,Mary Markart,Sudha Vijayakumar,Sophia Yuan,Vincent Stowbunenko,Amanda N. Shelton,William B. Andreopoulos
DOI: https://doi.org/10.1101/2022.09.30.510284
2024-01-30
Abstract:Genome assembly tools are used to reconstruct genomic sequences from raw sequencing data, which are then used for identifying the organisms present in a metagenomic sample. More recently, machine learning approaches have been applied to a variety of bioinformatics problems, and in this paper, we explore their use for organism identification. We start out by evaluating several commonly used metagenomic assembly tools, including PhyloFlash, MEGAHIT, MetaSPAdes, Kraken2, Mothur, UniCycler, and PathRacer, and compare them against state-of-the art deep learning-based machine learning classification approaches represented by DNABERT and DeLUCS, in the context of two synthetic mock community datasets. Our analysis focuses on determining whether ensembling metagenome assembly tools with machine learning tools has the potential to improve identification performance relative to using the tools individually. We find that this is indeed the case, and analyze the level of effectiveness of potential tool ensembling for organisms with different characteristics (based on factors such as repetitiveness, genome size, and GC content).
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the accuracy of microbial identification by combining traditional genome assembly tools with modern machine - learning methods, especially the recognition performance at the species level. Specifically, the authors evaluated several commonly used metagenome assembly tools (such as PhyloFlash, MEGAHIT, MetaSPAdes, Kraken2, Mothur, UniCycler, PathRacer) and deep - learning - based machine - learning classification methods (such as DNABERT and DeLUCS), and explored the performance of these tools on microorganisms with different characteristics (based on factors such as repeatability, genome size and GC content). The focus of the study is to determine whether the recognition performance can be improved by using metagenome assembly tools in combination with machine - learning tools, compared to using these tools alone. The paper mentions that although existing metagenomic analysis methods show reasonable performance at the genus level and above, there are still significant challenges in identifying closely related species and strains. Therefore, this article aims to explore whether more accurate microbial identification can be achieved at the species level by using different tools in combination.