Predicting the bacterial host range of plasmid genomes using the language model-based one-class SVM algorithm

Tao Feng,Xirao Chen,Shufang Wu,Hongwei Zhou,Zhencheng Fang
DOI: https://doi.org/10.1101/2024.08.27.609848
2024-08-28
Abstract:The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as NCBI is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobilizable plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we develop a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no negative samples learning task. Using multiple one-class SVMs that do not require negative samples for training, the HRPredict predicts the host range of plasmids across 45 families, 56 genera, and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the AUC, F1-score, recall, precision, and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through the feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity, and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids.
Bioinformatics
What problem does this paper attempt to address?