APHIX: Analysis Pipeline for HIV-1 Isoform eXploration Using Long-read RNA Sequencing Data

Jessica L Albert,Christian M Gallardo,Bruce E Torbett
DOI: https://doi.org/10.1101/2024.12.09.627634
2024-12-15
Abstract:HIV-1 uses 4 major splice donors and 8 major splice acceptors as well as dozens of minor, cryptic, and uncharacterized splice sites to produce over one hundred distinct transcript isoforms from a single 9.2 kb genome. As a result, existing bioinformatic pipelines struggle to accurately analyze spliced HIV sequences due to the complex nature of HIV alternative splicing compared to human mRNA splicing. Previous approaches to identify HIV isoforms from long-read sequencing data used pipelines that are not publicly available, are convoluted to operate, or are locked into a specific HIV strain, which limits their wide adoption to other experimental designs or systems. To address this gap, we have developed a bioinformatic pipeline called APHIX that fully automates spliced isoform assignment, splice site usage quantification, and non-coding exon detection. APHIX takes a FASTQ/A of long-read transcripts and a HIV genome reference sequence and fully automates HIV isoform analysis. APHIX calculates splice site usage counts and percentages for each donor and acceptor site and their pairwise combinations, accurately assigns isoforms, and automatically identifies transcripts containing non-coding exons. APHIX is compatible with long-reads generated from multiple platforms and library preps, including direct DNA and RNA sequencing. APHIX can also be adapted to multiple HIV-1 clades and strains by providing the appropriate reference sequence during bioinformatic processing. Overall, APHIX enables comprehensive processing of spliced sequences with reproducible results in a manner that is faster and easier to run compared to other methods.
Biology
What problem does this paper attempt to address?