PyamilySeq: A Python Tool for Interpretable Gene (Re)Clustering and Pangenomic Inference Across Species and Genera

Nicholas J. Dimonaco
2024-07-28
Abstract:PyamilySeq is a Python-based tool designed for interpretable gene clustering and pangenomic inference, supporting analyses at both species and genus levels. It facilitates the clustering of gene sequences into families based on sequence similarity using CD-HIT, and can take the output of tried-and-tested sequence clustering tools such as CD-HIT, BLAST, DIAMOND, and MMseqs2. PyamilySeq is distinctive in its ability to integrate new sequences into existing clusters, providing a robust framework for iterative analysis while preserving the original clusters, useful when reannotating genomes. In addition to the standard Species mode which as with other tools performs core-gene analysis across a species range, PyamilySeq can be run in Genus mode where it detects the presence of gene families shared across multiple genera. These features enhance the tools applicability for ongoing and past genomic studies and comparative analyses. PyamilySeq generates comprehensive outputs, including gene presence-absence matrices and aligned sequence data, enabling downstream analysis and interpretation of the identified gene groups and pangenomic data.
Genomics
What problem does this paper attempt to address?
This paper aims to introduce a Python tool named PyamilySeq, which is used for gene (re -) clustering and pan - genome inference and supports analysis at the species and genus levels. Specifically, the paper addresses the following issues: 1. **Challenges in gene clustering and pan - genome analysis**: Existing tools have challenges in terms of usability, applicability, and interoperability. PyamilySeq addresses these issues by providing a user - friendly platform, enabling researchers to cluster gene sequences based on sequence similarity and support analysis at the species and genus levels. 2. **Integration of new sequences**: Existing gene - clustering tools are often difficult to integrate new sequences into existing clusters. PyamilySeq provides a powerful framework by allowing users to add new sequences to existing clusters, supporting iterative analysis while retaining the original clusters. 3. **Ability for cross - genus analysis**: Most existing pan - genome tools can only perform core - gene analysis within a single species, while PyamilySeq can detect shared gene families at the genus level, thus expanding the scope of research and enabling researchers to more broadly understand the gene distribution among different genera. 4. **Flexible input options**: PyamilySeq supports the outputs of multiple sequence - clustering tools (such as CD - HIT, BLAST, DIAMOND, and MMseqs2), and can handle DNA or amino - acid sequences / clusters as input. In addition, users can also specify "core / soft - core / accessory" definitions to make it more in line with the distribution of input data. 5. **Generate comprehensive outputs**: PyamilySeq generates comprehensive output files, including gene presence - absence matrices and aligned sequence data, which can be used for downstream analysis and interpretation of identified genomic and pan - genomic data. In summary, through its unique features and flexibility, PyamilySeq provides a powerful tool for gene clustering and pan - genome analysis, especially suitable for research requiring comparative analysis across species and genera.