Abstract:Metagenomic binning is an essential task in analyzing metagenomic sequence datasets. To analyze structure or function of microbial communities from environmental samples, metagenomic sequence fragments are assigned to their taxonomic origins. Although sequence alignment algorithms can readily be used and usually provide high-resolution alignments and accurate binning results, the computational cost of such alignment-based methods becomes prohibitive as metagenomic datasets continue to grow. Alternative compositional-based methods, which exploit sequence composition by profiling local short k-mers in fragments, are often faster but less accurate than alignment-based methods. Inspired by the success of linear error correcting codes in noisy channel communication, we introduce Opal, a fast and accurate novel compositional-based binning method. It incorporates ideas from Gallager's low-density parity-check code to design a family of compact and discriminative locality-sensitive hashing functions that encode long-range compositional dependencies in long fragments. By incorporating the Gallager LSH functions as features in a simple linear SVM, Opal provides fast, accurate and robust binning for datasets consisting of a large number of species, even with mutations and sequencing errors. Opal not only performs up to two orders of magnitude faster than BWA, an alignment-based binning method, but also achieves improved binning accuracy and robustness to sequencing errors. Opal also outperforms models built on traditional k-mer profiles in terms of robustness and accuracy. Finally, we demonstrate that we can effectively use Opal in the "coarse search" stage of a compressive genomics pipeline to identify a much smaller candidate set of taxonomic origins for a subsequent alignment-based method to analyze, thus providing metagenomic binning with high scalability, high accuracy and high resolution.

SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing

SemiBin: Incorporating Information from Reference Genomes with Semi-Supervised Deep Learning Leads to Better Metagenomic Assembled Genomes (Mags)

A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments

SolidBin: Improving Metagenome Binning with Semi-Supervised Normalized Cut

Integrating chromatin conformation information in a self-supervised learning model improves metagenome binning

Effective binning of metagenomic contigs using contrastive multi-view representation learning

CLMB: deep contrastive learning for robust metagenomic binning

Binning Metagenomic Contigs Using Contig Embedding and Decomposed Tetranucleotide Frequency

SMeta, a binning tool using single-cell sequences to aid in reconstructing species from metagenome accurately

MetaBinner: a High-Performance and Stand-Alone Ensemble Binning Method to Recover Individual Genomes from Complex Microbial Communities

Binning meets taxonomy: TaxVAMB improves metagenome binning using bi-modal variational autoencoder

MetaBinG2: a fast and accurate metagenomic sequence classification system for samples with many unknown organisms

A New Unsupervised Binning Approachfor Metagenomic Sequences Based onN-grams and Automatic Feature Weighting.

A novel deep contrastive convolutional autoencoder based binning approach for taxonomic independent metagenomics data

CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads

Binette: a fast and accurate bin refinement tool to construct high quality Metagenome Assembled Genomes

scSemiAE: a deep model with semi-supervised learning for single-cell transcriptomics

GenomeFace: a deep learning-based metagenome binner trained on 43,000 microbial genomes

Low-density locality-sensitive hashing boosts metagenomic binning

Exploiting Topic Modeling to Boost Metagenomic Reads Binning.

BinBencher: Fast, flexible and meaningful benchmarking suite for metagenomic binning