Deep learning for predicting 16S rRNA gene copy number

Jiazheng Miao,Tianlai Chen,Mustafa Misir,Yajuan Lin
DOI: https://doi.org/10.1038/s41598-024-64658-5
IF: 4.6
2024-06-22
Scientific Reports
Abstract:Culture-independent 16S rRNA gene metabarcoding is a commonly used method for microbiome profiling. To achieve more quantitative cell fraction estimates, it is important to account for the 16S rRNA gene copy number (hereafter 16S GCN) of different community members. Currently, there are several bioinformatic tools available to estimate the 16S GCN values, either based on taxonomy assignment or phylogeny. Here we present a novel approach ANNA16, Artificial Neural Network Approximator for 16S rRNA gene copy number, a deep learning-based method that estimates the 16S GCN values directly from the 16S gene sequence strings. Based on 27,579 16S rRNA gene sequences and gene copy number data from the rrnDB database, we show that ANNA16 outperforms the commonly used 16S GCN prediction algorithms. Interestingly, Shapley Additive exPlanations (SHAP) shows that ANNA16 can identify unexpected informative positions in 16S rRNA gene sequences without any prior phylogenetic knowledge, which suggests potential applications beyond 16S GCN prediction.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper attempts to address the issue in microbiome analysis where the relative abundance data of microorganisms obtained through 16S rRNA gene sequencing technology cannot accurately reflect the actual cell proportions due to differences in the 16S rRNA gene copy number (16S GCN) among different microorganisms. Specifically, since the 16S rRNA gene copy number per genome varies from 1 to 21 among different microbial species, the relative abundance based on 16S rRNA gene read counts does not truly reflect the microbial composition in the sample. To solve this problem, researchers have developed various bioinformatics tools to predict 16S GCN values, but these methods usually rely on taxonomic or phylogenetic information, which has certain limitations. To address the above issue, this paper proposes a new method based on deep learning—ANNA16 (Artificial Neural Network Approximator for 16S rRNA gene copy number), which directly predicts 16S GCN values from 16S rRNA gene sequences. This method aims to improve the accuracy of 16S GCN prediction, thereby providing more accurate quantitative analysis of the microbiome. By comparing with existing taxonomy- and phylogeny-based methods, the study shows that ANNA16 has higher accuracy and robustness in predicting 16S GCN, especially when dealing with partial regions of the 16S rRNA gene. Additionally, the study uses the Shapley Additive exPlanations (SHAP) method to interpret the model, revealing potential informative sites in the 16S rRNA gene sequence that were not identified in previous studies.