BiG-FAM: the biosynthetic gene cluster families database

Satria A Kautsar,Kai Blin,Simon Shaw,Tilmann Weber,Marnix H Medema
DOI: https://doi.org/10.1093/nar/gkaa812
IF: 14.9
2020-10-03
Nucleic Acids Research
Abstract:Abstract Computational analysis of biosynthetic gene clusters (BGCs) has revolutionized natural product discovery by enabling the rapid investigation of secondary metabolic potential within microbial genome sequences. Grouping homologous BGCs into Gene Cluster Families (GCFs) facilitates mapping their architectural and taxonomic diversity and provides insights into the novelty of putative BGCs, through dereplication with BGCs of known function. While multiple databases exist for exploring BGCs from publicly available data, no public resources exist that focus on GCF relationships. Here, we present BiG-FAM, a database of 29,955 GCFs capturing the global diversity of 1,225,071 BGCs predicted from 209,206 publicly available microbial genomes and metagenome-assembled genomes (MAGs). The database offers rich functionalities, such as multi-criterion GCF searches, direct links to BGC databases such as antiSMASH-DB, and rapid GCF annotation of user-supplied BGCs from antiSMASH results. BiG-FAM can be accessed online at https://bigfam.bioinformatics.nl.
biochemistry & molecular biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **The lack of a public resource focusing on the relationships of gene cluster families (GCFs) to support the discovery and research of microbial secondary metabolites (NPs)**. Specifically, although there are already multiple databases for exploring biosynthetic gene clusters (BGCs), these databases are mainly analyzed based on sequence similarity and cannot effectively reveal the global relationships between different taxa. In addition, existing tools and resources have limitations in handling large - scale BGC data and are difficult to provide comprehensive, cross - taxa BGC relationship analysis. To solve these problems, the authors developed the BiG - FAM database. BiG - FAM provides the following advantages by clustering homologous BGCs into GCFs: 1. **Global diversity analysis**: BiG - FAM captures the global diversity of 1,225,071 BGCs predicted from 209,206 publicly available microbial genomes and metagenome - assembled genomes (MAGs). 2. **Fast query and annotation**: Users can quickly query and annotate newly sequenced BGCs and compare them with BGCs of known functions, thereby evaluating their novelty and potential functions. 3. **Rich functionality**: BiG - FAM provides functions such as multi - criteria GCF search, direct links to other BGC databases (such as antiSMASH - DB), and rapid GCF annotation of user - provided BGCs. Through these functions, BiG - FAM aims to fill the gaps in existing resources and provide more powerful tools and support for the research of microbial secondary metabolites. ### Formula representation This paper does not involve specific mathematical or physical formulas, but involves some calculation methods and algorithms, for example: - **BiG - SLiCE** uses a near - linear clustering algorithm to process more than 1 million BGCs: \[ T(n)=O(n) \] where \(n\) is the number of BGCs, and \(T(n)\) represents the time complexity. - **GCF model** is a Euclidean feature matrix constructed by summarizing the shared BGC features of each GCF: \[ M_{GCF}=\left[\begin{array}{ccc} f_1&f_2&\cdots&f_m\\ \end{array}\right] \] where \(f_i\) represents the \(i\) - th feature. These methods enable BiG - FAM to efficiently process and analyze large - scale BGC data.