A Fast Word Embedding Based Classifier to Profile Target Gene Databases in Metagenomic Samples

Gustavo Arango Argoty,Lenwood S. Heath,Amy Pruden,Peter J. Vikesland,Liqing Zhang
DOI: https://doi.org/10.1007/978-3-030-79290-9_10
2021-01-01
Abstract:The functional profile ofmetagenomic samples allows the understanding of the role of the microbes in the environment. Sequence alignment of short reads against curated databases has been widely used to profile metagenomic samples. However, this method is time consuming and requires high computing resources. Although several alignment free methods based on k-mer composition have been developed in recent years, they still require a large amount of memory. In this paper, MetaMLP (Metagenomics Machine Learning Profiler), a machine learning method that represents sequences into numerical vectors (embeddings) and uses a simple one hidden layer neural network is proposed to profile functional categories. Unlike other methods, MetaMLP enables partial matching through a reduced alphabet for sequence embeddings. MetaMLP is able to identify a larger number of reads compared to Diamond (one of the fastest sequence alignment methods) while maintaining high performance with a 0.99 precision and a 0.99 recall. MetaMLP can process 100 million reads in around 10 min in a laptop computer, a 50x speed up compared to Diamond. MetaMLP is freely available at https://bitbucket.org/gaarangoa/metamlp/src/master/.
What problem does this paper attempt to address?