Deep-BGCpred: A unified deep learning genome-mining framework for biosynthetic gene cluster prediction

Ziyi Yang,Benben Liao,Changyu Hsieh,Chao Han,Liang Fang,Shengyu Zhang
DOI: https://doi.org/10.1101/2021.11.15.468547
2021-11-16
Abstract:Natural products produced by microorganisms constitute an important source of essential pharmaceuticals, including antimicrobial and anti-tumor drugs. These bioactive molecules are microbial secondary metabolites synthesized by co-localized genes termed Biosynthetic Gene Clusters (BGCs). The rapid increase of microbial genomics resources, due to the availability of high-throughput sequencing technologies, has spurred the development of computational methods for microbial genome mining for BGC discovery. Current machine learning methods, however, have limited successes in uncovering novel BGCs due to an excessive number of false positives in their predictions. To this end, we propose Deep-BGCpred, a framework that effectively addresses the aforementioned issue by improving a deep learning model termed DeepBGC. The new model embeds multi-source protein family domains and employs a stacked Bidirectional Long Short-Term Memory model to boost accuracy for BGC identifications. In particular, it integrates two customized strategies, sliding window strategy and dual-model serial screening, to improve the model’s performance stability and reduce the number of false positive in BGC predictions. We compare the proposed model against other well-established methods on common benchmarks and achieve new state-of-the-art results with convincing evidences. We expect that researchers working on genome mining for natural products may be greatly benefited from our newly proposed method, Deep-BGCpred.
What problem does this paper attempt to address?