dgfr: an R package to assess sequence diversity of gene families

Laila Viana Almeida,João Luís Reis-Cunha,Daniella C. Bartholomeu
DOI: https://doi.org/10.1186/s12859-024-05826-2
IF: 3.307
2024-06-09
BMC Bioinformatics
Abstract:Gene families are groups of homologous genes that often have similar biological functions. These families are formed by gene duplication events throughout evolution, resulting in multiple copies of an ancestral gene. Over time, these copies can acquire mutations and structural variations, resulting in members that may vary in size, motif ordering and sequence. Multigene families have been described in a broad range of organisms, from single-celled bacteria to complex multicellular organisms, and have been linked to an array of phenomena, such as host–pathogen interactions, immune evasion and embryonic development. Despite the importance of gene families, few approaches have been developed for estimating and graphically visualizing their diversity patterns and expression profiles in genome-wide studies.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
The paper attempts to address the issue of sequence diversity assessment and visualization of gene families. Specifically: 1. **Background**: - Gene families are groups of genes with a common evolutionary origin, often having similar biological functions. - Gene families are formed through gene duplication events during evolution, leading to multiple copies of ancestral genes. These copies accumulate mutations and structural variations over time, resulting in differences in size, motif order, and sequence among family members. - Gene families are widespread across various organisms, from bacteria to humans, and are associated with various phenomena such as host-pathogen interactions, immune evasion, and embryonic development. - Despite the evident importance of gene families, there are currently few methods available for estimating and visualizing their diversity patterns and expression profiles. 2. **Problem**: - Existing methods are inadequate in handling the diversity and expression data of multigene families, especially during multiple sequence alignment, as differences in size and motif order among family members make alignment challenging. - There is a need for a tool to effectively estimate and visualize the sequence diversity and expression patterns of gene families. 3. **Solution**: - The authors developed an R package named `dgfr`, which can estimate and visualize sequence diversity within gene families and visualize secondary data (such as gene expression). - The `dgfr` package achieves this through the following steps: - Input a multi-sequence FASTA file containing coding sequences (CDS) or amino acid sequences. - Perform pairwise alignment to estimate distances between sequences. - Conduct dimensionality reduction, determine the optimal number of clusters, and assign genes to each cluster. - Generate datasets that allow users to visualize sequence diversity and expression patterns within gene families. 4. **Conclusion**: - `dgfr` provides a method to estimate and study the diversity of gene families and visualize the dispersion of sequences and secondary features. - The tool has a user-friendly interface and is easy to operate, requiring only a FASTA sequence file of gene family members as input. - The `dgfr` package is freely available on GitHub under the GPL-3 license.