Abstract:Background: Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. The mentioned ILP includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space. Results: In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into [Formula: see text] subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on primate and fruit fly genomes show two positive results. First, for complete assemblies of five primates the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the version of our tool with optimal capping. Second, we were able to efficiently analyze fruit fly genomes with incomplete assemblies distributed in hundreds or even thousands of contigs, obtaining gene families that are very similar to [Formula: see text] families. Indeed, our tool inferred a higher number of complete cliques, with a higher intersection with [Formula: see text], when compared to gene families computed by other inference tools. We added a post-processing for refining, with the aid of the [Formula: see text] algorithm, our ambiguous families (those with more than one gene per genome), improving even more the accuracy of our results. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities and the post-processing refinement of ambiguous families with [Formula: see text]. Both the original version with optimal capping and the new modified version with heuristic capping can be downloaded, together with their detailed documentations, at https://gitlab.ub.uni-bielefeld.de/gi/FFGC or as a Conda package at https://anaconda.org/bioconda/ffgc .

PyamilySeq: A Python Tool for Interpretable Gene (Re)Clustering and Pangenomic Inference Across Species and Genera

Klumpy: A Tool to Evaluate the Integrity of Long-Read Genome Assemblies and Illusive Sequence Motifs

pyGeno: A Python package for precision medicine and proteogenomics

Klumpy: A tool to evaluate the integrity of long‐read genome assemblies and illusive sequence motifs

diverse-seq: an application for alignment-free selecting and clustering biological sequences

Agalma: an automated phylogenomics workflow

Efficient gene orthology inference via large-scale rearrangements

GALEON: a comprehensive bioinformatic tool to analyse and visualize gene clusters in complete genomes

ProSeq4: A user‐friendly multiplatform program for preparation and analysis of large‐scale DNA polymorphism datasets

PhyloSuite: An integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies

A composite genome approach to identify phylogenetically informative data from next-generation sequencing

ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes

Unbiased pangenome graphs

SPANDx: a genomics pipeline for comparative analysis of large haploid whole genome re-sequencing datasets

Cluster efficient pangenome graph construction with nf-core/pangenome

Sciphy: A Bayesian phylogenetic framework using sequential genetic lineage tracing data.

Reseqtools: an Integrated Toolkit for Large-Scale Next-Generation Sequencing Based Resequencing Analysis

PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes

Seqminer2: an efficient tool to query and retrieve genotypes for statistical genetics analyses from biobank scale sequence dataset

Integer programming framework for pangenome-based genome inference

PPanG: a precision pangenome browser enabling nucleotide-level analysis of genomic variations in individual genomes and their graph-based pangenome