DSNetax: a deep learning species annotation method based on a deep-shallow parallel framework

Hongyuan Zhao,Suyi Zhang,Hui Qin,Xiaogang Liu,Dongna Ma,Xiao Han,Jian Mao,Shuangping Liu
DOI: https://doi.org/10.1093/bib/bbae157
IF: 9.5
2024-03-27
Briefings in Bioinformatics
Abstract:Abstract Microbial community analysis is an important field to study the composition and function of microbial communities. Microbial species annotation is crucial to revealing microorganisms’ complex ecological functions in environmental, ecological and host interactions. Currently, widely used methods can suffer from issues such as inaccurate species-level annotations and time and memory constraints, and as sequencing technology advances and sequencing costs decline, microbial species annotation methods with higher quality classification effectiveness become critical. Therefore, we processed 16S rRNA gene sequences into k-mers sets and then used a trained DNABERT model to generate word vectors. We also design a parallel network structure consisting of deep and shallow modules to extract the semantic and detailed features of 16S rRNA gene sequences. Our method can accurately and rapidly classify bacterial sequences at the SILVA database’s genus and species level. The database is characterized by long sequence length (1500 base pairs), multiple sequences (428,748 reads) and high similarity. The results show that our method has better performance. The technique is nearly 20% more accurate at the species level than the currently popular naive Bayes-dominated QIIME 2 annotation method, and the top-5 results at the species level differ from BLAST methods by <2%. In summary, our approach combines a multi-module deep learning approach that overcomes the limitations of existing methods, providing an efficient and accurate solution for microbial species labeling and more reliable data support for microbiology research and application.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in microbial species annotation: 1. **Limitations of existing methods**: - Widely - used methods currently have the problem of insufficient accuracy in species - level annotation. - Existing methods face time and memory limitations when dealing with large - scale data and it is difficult to efficiently process high - throughput sequencing data. 2. **Improving classification accuracy**: - With the progress of sequencing technology and the decrease in cost, it is necessary to develop higher - quality microbial species annotation methods to improve the classification effect. 3. **Handling long sequences and high - similarity data**: - Traditional annotation methods perform poorly when dealing with long sequences (such as 1,500 base pairs) and high - similarity data sets (such as 428,748 reads), and new methods are needed to meet these challenges. ### Specific solutions To solve the above problems, the paper proposes a deep - learning species - annotation method based on a deep - shallow parallel framework - **DSNetax**. Specifically: - **Data pre - processing**: Convert 16S rRNA gene sequences into k - mers sets and use the trained DNABERT model to generate word vectors. - **Model structure**: Design a parallel network structure containing deep and shallow modules to extract semantic and detailed features of 16S rRNA gene sequences. - **Performance improvement**: Verified by experiments, the classification accuracy of DSNetax at the genus and species levels is nearly 20% higher than that of the currently popular QIIME 2 annotation method based on Naive Bayes respectively, and the difference between the top - five results at the species level and the BLAST method is less than 2%. ### Conclusion DSNetax combines multi - module deep - learning methods, overcomes the limitations of existing methods, provides an efficient and accurate solution for microbial species annotation, and provides more reliable data support for microbiology research and applications.