Recent development of antiSMASH and other computational approaches to mine secondary metabolite biosynthetic gene clusters

Kai Blin,Hyun Uk Kim,Marnix H Medema,Tilmann Weber
DOI: https://doi.org/10.1093/bib/bbx146
IF: 9.5
2017-11-03
Briefings in Bioinformatics
Abstract:Abstract Many drugs are derived from small molecules produced by microorganisms and plants, so-called natural products. Natural products have diverse chemical structures, but the biosynthetic pathways producing those compounds are often organized as biosynthetic gene clusters (BGCs) and follow a highly conserved biosynthetic logic. This allows for the identification of core biosynthetic enzymes using genome mining strategies that are based on the sequence similarity of the involved enzymes/genes. However, mining for a variety of BGCs quickly approaches a complexity level where manual analyses are no longer possible and require the use of automated genome mining pipelines, such as the antiSMASH software. In this review, we discuss the principles underlying the predictions of antiSMASH and other tools and provide practical advice for their application. Furthermore, we discuss important caveats such as rule-based BGC detection, sequence and annotation quality and cluster boundary prediction, which all have to be considered while planning for, performing and analyzing the results of genome mining studies.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to effectively mine secondary metabolite biosynthetic gene clusters (BGCs) in microbial and plant genomes using computational methods (such as antiSMASH and other tools) to discover new natural products. ### Background and Problem Many drugs are derived from small molecules produced by microbes and plants, known as natural products. Natural products have diverse chemical structures, but their biosynthetic pathways are usually organized into biosynthetic gene clusters (BGCs) and follow highly conserved biosynthetic logic. This allows for the identification of core biosynthetic enzymes through sequence similarity-based genome mining strategies. However, with the increasing variety of BGCs, manual analysis is no longer feasible, necessitating the use of automated genome mining pipelines, such as the antiSMASH software. ### Main Issues 1. **Complexity**: With the increase in BGC types, manual analysis becomes very complex, requiring automated tools to handle. 2. **Prediction Accuracy**: How to improve the accuracy of BGC predictions, especially among different types of secondary metabolites. 3. **Data Quality**: The quality of input data is crucial for the reliability of results, particularly for short-read sequencing technologies and metagenomic data. 4. **Discovery of New Pathways**: How to discover new pathways involving unknown or unrelated alternative enzymes, i.e., "biosynthetic dark matter." ### Solutions 1. **Rule-based Methods**: Utilize known key biosynthetic steps/principles to define BGCs through the presence of specific enzymes. For example, antiSMASH uses a series of rules to detect different types of BGCs. 2. **Probabilistic Methods**: Algorithms like ClusterFinder can identify BGCs that expert-generated rule sets fail to detect. 3. **Evolutionary Mining**: The EvoMining method identifies enzymes that may have been repurposed for secondary metabolite biosynthesis by detecting divergences of core metabolic enzymes in phylogenetic trees. 4. **Data Quality Control**: Ensure the quality of input data, especially when using short-read sequencing technologies and metagenomic data, to avoid gene fragmentation and gene dispersion across different contigs. ### Practical Applications - **Genome Mining**: Conduct large-scale genome mining studies using tools like antiSMASH to discover new natural products. - **Functional Annotation**: Provide detailed BGC annotations, including the functions of core enzymes, substrate specificity, etc. - **Comparative Genomics**: Use modules like ClusterBlast to compare BGCs across different species and discover similar gene clusters. ### Conclusion This paper discusses the application of current computational methods in the mining of secondary metabolite biosynthetic gene clusters and provides practical recommendations, emphasizing the importance of data quality and new methods. These methods not only improve the accuracy of BGC predictions but also provide strong support for the discovery of new natural products.