VEHoP: A Versatile, Easy-to-use, and Homology-based Phylogenomic pipeline accommodating diverse sequences

Yunlong Li,Xu Liu,Chong Chen,Jian-Wen Qiu,Kevin Kocot,Jin Sun
DOI: https://doi.org/10.1101/2024.07.24.604968
2024-07-24
Abstract:Phylogenomics has become a prominent method in systematics, conservation biology, and biomedicine, as it can leverage hundreds to thousands of genes derived from genomic or transcriptomic data to infer evolutionary relationships. However, obtaining high-quality genomes and transcriptomes requires samples preserved with high-quality DNA and RNA and demands considerable sequencing costs and lofty bioinformatic efforts (e.g., genome/transcriptome assembly and annotation). Notably, only fragmented DNA reads are accessible in some rare species due to the difficulty in sample collection and preservation, such as those inhabiting the deep sea. To address this issue, we here introduce the VEHoP (Versatile, Easy-to-use Homology-based Phylogenomic) pipeline, designed to infer protein-coding regions from DNA assemblies and generate alignments of orthologous sequences, concatenated matrices, and phylogenetic trees. This pipeline aims to 1) expand taxonomic sampling by accommodating a wide range of input files, including draft genomes, transcriptomes, and well-annotated genomes, and 2) simplify the process of conducting phylogenomic analyses and thus make it more accessible to researchers from diverse backgrounds. We first evaluated the performance of VEHoP using datasets of Ostreida, yielding robust phylogenetic trees with strong bootstrap support. We then applied VEHoP to reconstruct the phylogenetic relationship in the enigmatic deep-sea gastropod order Neomphalida, obtaining a robust phylogenetic backbone for this group. The VEHoP is freely available on GitHub (https://github.com/ylify/VEHoP), whose dependencies can be easily installed using Bioconda.
Bioinformatics
What problem does this paper attempt to address?
The main problem this paper attempts to address is the simplification and expansion of phylogenomic analysis workflows based on genomic data. Specifically, the study develops a new pipeline called VEHoP (Versatile, Easy-to-use Homology-based Phylogenomic pipeline) aimed at solving the following key issues: 1. **Diverse input files**: Existing phylogenetic analysis tools typically require high-quality genome or transcriptome data, which is a challenge for some species with difficult-to-obtain samples (such as deep-sea organisms). VEHoP can accept various types of input files, including draft genomes, transcriptomes, and high-quality annotated genomes, or a combination of these. 2. **Simplified analysis workflow**: Traditional phylogenomic analysis workflows are cumbersome and time-consuming, requiring a series of complex steps such as genome assembly and annotation. VEHoP completes the entire process from input data to phylogenetic tree construction with a single simple command, greatly reducing the operational difficulty for researchers. 3. **Increased species sampling range**: Due to difficulties in sample preservation or high sequencing costs, some rare species can only obtain fragmented DNA sequences. VEHoP can reconstruct reliable phylogenetic relationships from these fragmented data, thereby expanding the range of species sampling. Through two case studies (oysters and deep-sea Neogastropoda mollusks), the authors demonstrate the efficiency and accuracy of VEHoP in handling different types of data and compare it with other existing tools. The results show that VEHoP can effectively utilize data from multiple sources to construct stable phylogenetic trees, particularly excelling in handling low-quality or fragmented data.