Abstract:Abstract Many drugs are derived from small molecules produced by microorganisms and plants, so-called natural products. Natural products have diverse chemical structures, but the biosynthetic pathways producing those compounds are often organized as biosynthetic gene clusters (BGCs) and follow a highly conserved biosynthetic logic. This allows for the identification of core biosynthetic enzymes using genome mining strategies that are based on the sequence similarity of the involved enzymes/genes. However, mining for a variety of BGCs quickly approaches a complexity level where manual analyses are no longer possible and require the use of automated genome mining pipelines, such as the antiSMASH software. In this review, we discuss the principles underlying the predictions of antiSMASH and other tools and provide practical advice for their application. Furthermore, we discuss important caveats such as rule-based BGC detection, sequence and annotation quality and cluster boundary prediction, which all have to be considered while planning for, performing and analyzing the results of genome mining studies.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to effectively mine secondary metabolite biosynthetic gene clusters (BGCs) in microbial and plant genomes using computational methods (such as antiSMASH and other tools) to discover new natural products. ### Background and Problem Many drugs are derived from small molecules produced by microbes and plants, known as natural products. Natural products have diverse chemical structures, but their biosynthetic pathways are usually organized into biosynthetic gene clusters (BGCs) and follow highly conserved biosynthetic logic. This allows for the identification of core biosynthetic enzymes through sequence similarity-based genome mining strategies. However, with the increasing variety of BGCs, manual analysis is no longer feasible, necessitating the use of automated genome mining pipelines, such as the antiSMASH software. ### Main Issues 1. **Complexity**: With the increase in BGC types, manual analysis becomes very complex, requiring automated tools to handle. 2. **Prediction Accuracy**: How to improve the accuracy of BGC predictions, especially among different types of secondary metabolites. 3. **Data Quality**: The quality of input data is crucial for the reliability of results, particularly for short-read sequencing technologies and metagenomic data. 4. **Discovery of New Pathways**: How to discover new pathways involving unknown or unrelated alternative enzymes, i.e., "biosynthetic dark matter." ### Solutions 1. **Rule-based Methods**: Utilize known key biosynthetic steps/principles to define BGCs through the presence of specific enzymes. For example, antiSMASH uses a series of rules to detect different types of BGCs. 2. **Probabilistic Methods**: Algorithms like ClusterFinder can identify BGCs that expert-generated rule sets fail to detect. 3. **Evolutionary Mining**: The EvoMining method identifies enzymes that may have been repurposed for secondary metabolite biosynthesis by detecting divergences of core metabolic enzymes in phylogenetic trees. 4. **Data Quality Control**: Ensure the quality of input data, especially when using short-read sequencing technologies and metagenomic data, to avoid gene fragmentation and gene dispersion across different contigs. ### Practical Applications - **Genome Mining**: Conduct large-scale genome mining studies using tools like antiSMASH to discover new natural products. - **Functional Annotation**: Provide detailed BGC annotations, including the functions of core enzymes, substrate specificity, etc. - **Comparative Genomics**: Use modules like ClusterBlast to compare BGCs across different species and discover similar gene clusters. ### Conclusion This paper discusses the application of current computational methods in the mining of secondary metabolite biosynthetic gene clusters and provides practical recommendations, emphasizing the importance of data quality and new methods. These methods not only improve the accuracy of BGC predictions but also provide strong support for the discovery of new natural products.

Recent development of antiSMASH and other computational approaches to mine secondary metabolite biosynthetic gene clusters

antiSMASH 3.0—a comprehensive resource for the genome mining of biosynthetic gene clusters

antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences

antiSMASH 4.0—improvements in chemistry prediction and gene cluster boundary identification

Mini review: Genome mining approaches for the identification of secondary metabolite biosynthetic gene clusters in Streptomyces

antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline

antiSMASH 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation

Computational Methods for Identification of Novel Secondary Metabolite Biosynthetic Pathways by Genome Analysis

Exploring Newer Biosynthetic Gene Clusters in Marine Microbial Prospecting

Genome mining strategies for ribosomally synthesised and post-translationally modified peptides

Genome mining for the search and discovery of bioactive compounds: The Streptomyces paradigm

Genome mining of biosynthetic and chemotherapeutic gene clusters in Streptomyces bacteria

The antiSMASH database version 3: increased taxonomic coverage and new query features for modular enzymes

Engineering fungal secondary metabolism: a roadmap to novel compounds.

A deep learning genome-mining strategy for biosynthetic gene cluster prediction

Predicting biological activity from biosynthetic gene clusters using neural networks

Navigating and expanding the roadmap of natural product genome mining tools

High-Throughput Mining of Novel Compounds from Known Microbes: A Boost to Natural Product Screening

Computational Tools for Discovering and Engineering Natural Product Biosynthetic Pathways

Next-generation synthetic biology approaches for the accelerated discovery of microbial natural products

Synthetic Biology Tools for Novel Secondary Metabolite Discovery in Streptomyces