Abstract:Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in Escherichia coli genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs, and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%-90% of all E. coli genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943-0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including "hypothetical proteins" was accurately predicted (F1 = 0.902 [0.898-0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse E. coli genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876-0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data. IMPORTANCE Having the ability to predict the protein-encoding gene content of a genome is important for assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance and other virulence genes. In this study, we built a set of binary classifiers for predicting the presence or absence of variable genes occurring in 10%-90% of all publicly available E. coli genomes. Overall, the results show that a large portion of the E. coli variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. This study offers a strategy for predicting gene content using limited input sequence data.

Multivariate Entropy Distance Method for Prokaryotic Gene Identification.

Multivariate Entropy Distance Method for Distinguishing Coding and Non-coding DNA Sequences

Gene Prediction by the Noise-Assisted MEMD and Wavelet Transform for Identifying the Protein Coding Regions

MED: a New Non-Supervised Gene Prediction Algorithm for Bacterial and Archaeal Genomes

SAVMD: an Adaptive Signal Processing Method for Identifying Protein Coding Regions

Accuracy improvement for identifying translation initiation sites in microbial genomes

A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

A comparative genomic method for computational identification of prokaryotic translation initiation sites

A multi-approaches-guided genetic algorithm with application to operon prediction

Gene-finding via tandem mass spectrometry

Classification of bacterial plasmid and chromosome derived sequences using machine learning

A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications.

Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection

Detecting Differentially Expressed Genes by Relative Entropy.

A New Efficient Method for Analyzing Fungi Species Using Correlations Between Nucleotides

Metabolic Classification of Microbial Genomes Using Functional Probes

Predicting variable gene content in Escherichia coli using conserved genes

Toward Spectral Library-Free Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry Bacterial Identification.

Spatial proximity and gene function: a new dimension in prokaryotic gene association network analysis with 3D-GeneNet

New methods to analyse microarray data that partially lack a reference signal

A DNA Barcoding system integrating multigene sequence data