mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species

Sergio Arredondo-Alonso,Malbert R. C. Rogers,Johanna C. Braat,T. D. Verschuuren,Janetta Top,Jukka Corander,Rob J.L. Willems,Anita C. Schürch
DOI: https://doi.org/10.1101/329045
2018-05-23
Abstract:Abstract Assembly of bacterial short-read whole genome sequencing (WGS) data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Long-read sequencing has emerged as a solution to resolve plasmid structures and to obtain complete genomes for most bacterial species. This information can be used to generate and label datasets from short-read based contigs as plasmid- or chromosome-derived. We investigated the use of several popular machine learning methods to classify short-read contigs with known plasmid- or chromosome-origin from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. Based on resulting F1-scores we selected support-vector machine (SVM) models as best classifier for all three bacterial species (F1-score E. faecium = 0.94, F1-score K. pneumoniae = 0.90, F1-score E. coli = 0.76), which outperformed other existing plasmid tools using an independent set of isolates (precision E. faecium = 0.92, precision K. pneumoniae = 0.86, precision E. coli = 0.82). We demonstrated the scalability of our model by accurately predicting the plasmidome of a large collection of 1,644 E. faecium isolates with only short-read WGS available using a standard laptop with a single core. A low number of false positive predicted sequences suggests that the assignment of a particular gene of interest as plasmid- or chromosome-encoded by the models is plausible. The SVM classifiers are publicly available as a new R package called ‘mlplasmids’ at https://gitlab.com/sirarredondo/mlplasmids under the GNU General Public License v3.0. We additionally developed a graphical-user interface using the Shiny package which can be accessed at https://sarredondo.shinyapps.io/mlplasmids/ . Single genomes can easily be predicted by uploading genome assemblies. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.
What problem does this paper attempt to address?