Abstract:Abstract Computational analysis of biosynthetic gene clusters (BGCs) has revolutionized natural product discovery by enabling the rapid investigation of secondary metabolic potential within microbial genome sequences. Grouping homologous BGCs into Gene Cluster Families (GCFs) facilitates mapping their architectural and taxonomic diversity and provides insights into the novelty of putative BGCs, through dereplication with BGCs of known function. While multiple databases exist for exploring BGCs from publicly available data, no public resources exist that focus on GCF relationships. Here, we present BiG-FAM, a database of 29,955 GCFs capturing the global diversity of 1,225,071 BGCs predicted from 209,206 publicly available microbial genomes and metagenome-assembled genomes (MAGs). The database offers rich functionalities, such as multi-criterion GCF searches, direct links to BGC databases such as antiSMASH-DB, and rapid GCF annotation of user-supplied BGCs from antiSMASH results. BiG-FAM can be accessed online at https://bigfam.bioinformatics.nl.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **The lack of a public resource focusing on the relationships of gene cluster families (GCFs) to support the discovery and research of microbial secondary metabolites (NPs)**. Specifically, although there are already multiple databases for exploring biosynthetic gene clusters (BGCs), these databases are mainly analyzed based on sequence similarity and cannot effectively reveal the global relationships between different taxa. In addition, existing tools and resources have limitations in handling large - scale BGC data and are difficult to provide comprehensive, cross - taxa BGC relationship analysis. To solve these problems, the authors developed the BiG - FAM database. BiG - FAM provides the following advantages by clustering homologous BGCs into GCFs: 1. **Global diversity analysis**: BiG - FAM captures the global diversity of 1,225,071 BGCs predicted from 209,206 publicly available microbial genomes and metagenome - assembled genomes (MAGs). 2. **Fast query and annotation**: Users can quickly query and annotate newly sequenced BGCs and compare them with BGCs of known functions, thereby evaluating their novelty and potential functions. 3. **Rich functionality**: BiG - FAM provides functions such as multi - criteria GCF search, direct links to other BGC databases (such as antiSMASH - DB), and rapid GCF annotation of user - provided BGCs. Through these functions, BiG - FAM aims to fill the gaps in existing resources and provide more powerful tools and support for the research of microbial secondary metabolites. ### Formula representation This paper does not involve specific mathematical or physical formulas, but involves some calculation methods and algorithms, for example: - **BiG - SLiCE** uses a near - linear clustering algorithm to process more than 1 million BGCs: \[ T(n)=O(n) \] where \(n\) is the number of BGCs, and \(T(n)\) represents the time complexity. - **GCF model** is a Euclidean feature matrix constructed by summarizing the shared BGC features of each GCF: \[ M_{GCF}=\left[\begin{array}{ccc} f_1&f_2&\cdots&f_m\\ \end{array}\right] \] where \(f_i\) represents the \(i\) - th feature. These methods enable BiG - FAM to efficiently process and analyze large - scale BGC data.

BiG-FAM: the biosynthetic gene cluster families database

BGC Atlas: A Web Resource for Exploring the Global Chemical Diversity Encoded in Bacterial Genomes

CropGF: a Comprehensive Visual Platform for Crop Gene Family Mining and Analysis.

MIBiG 2.0: a repository for biosynthetic gene clusters of known function

Fidbac: A Platform for Fast Bacterial Genome Identification and Typing

MyBASE: a Database for Genome Polymorphism and Gene Function Studies of Mycobacterium

MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters

BGCFlow: systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets

Expanding the genome information on for biosynthetic gene cluster discovery

Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters

HiFiBGC: an ensemble approach for improved biosynthetic gene cluster detection in PacBio HiFi-read metagenomes

antiSMASH 3.0—a comprehensive resource for the genome mining of biosynthetic gene clusters

An interpreted atlas of biosynthetic gene clusters from 1,000 fungal genomes

IMG-ABC: new features for bacterial secondary metabolism analysis and targeted biosynthetic gene cluster discovery in thousands of microbial genomes

GenFam: A web application and database for gene family‐based classification and functional enrichment analysis

Evolution and Diversity of Biosynthetic Gene Clusters in Fusarium

The antiSMASH database version 3: increased taxonomic coverage and new query features for modular enzymes

An atlas of bacterial secondary metabolite biosynthesis gene clusters

FIGfams: yet another set of protein families

Recent development of antiSMASH and other computational approaches to mine secondary metabolite biosynthetic gene clusters

A Comprehensive Self-Resistance Gene Database for Natural-Product Discovery with an Application to Marine Bacterial Genome Mining