Augmenting Transcriptome Annotations through the Lens of Splicing Evolution

Xiaofei Carl Zang,Ke Chen,Irtesam Mahmud Khan,Mingfu Shao
DOI: https://doi.org/10.1101/2024.11.04.621892
2024-11-06
Abstract:Alternative splicing (AS) is a ubiquitous mechanism in eukaryotes. It is estimated that 90% of human genes are alternatively spliced. Despite enormous efforts, transcriptome annotations remain, nevertheless, incomplete. Conventional means of annotation were largely driven by experimental data such as RNA-seq and protein sequences, while little insight was shed on understanding transcriptomes and alternative splicings from the perspective of evolution. This study addresses this critical gap by presenting TENNIS (Transcript EvolutioN for New Isoform Splicing), an evolution-based model to predict unannotated isoforms and refine existing annotations without requiring additional data. The model of TENNIS is based on two minimal premises--AS isoforms evolve sequentially from existing isoforms, and each evolutionary step involves a single AS event. We formulate the identification of missing transcripts as an optimization problem and parsimoniously find the minimal number of novel transcripts. Our analysis showed approximately 80% of multi-transcript groups from six transcriptome annotations satisfy our evolutionary model. At a high confidence level, 40% of isoforms predicted by TENNIS were validated by deep long-read RNA-seq. In a simulated incomplete annotation scenario, TENNIS dramatically outperforms two randomized baseline approaches by a 2.25-3 fold-change in precision or a 3.5-3.9 fold-change in recall, after controlling the same level of recall or precision of the baseline methods. These results demonstrate that TENNIS effectively identifies missing transcripts by complying with minimal propositions, offering a powerful approach for transcriptome augmentations through the lens of alternative splicing evolutions. TENNIS is freely available at https://github.com/Shao-Group/tennis.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the incompleteness of existing transcriptome annotations. Although a great deal of research has been carried out to improve transcriptome annotations, the existing annotations are still incomplete, especially in terms of alternative splicing (AS). Most annotation methods rely on experimental data, such as RNA - seq and protein sequences, while there are fewer studies on understanding the transcriptome and alternative splicing from an evolutionary perspective. In this paper, by proposing an evolution - based model - TENNIS (Transcript EvolutioN for New Isoform Splicing), the unannotated isomers are predicted and the existing annotations are improved without additional data input. Specifically, the TENNIS model is based on two basic premises: 1. Alternative splicing isomers are gradually evolved from existing isomers. 2. Each evolution step involves only one alternative splicing event. Through these premises, the author transforms the problem of identifying missing transcripts into an optimization problem, that is, finding the minimum number of new transcripts so that all alternative splicing isomers are connected into a connected graph. This model has been verified in the transcriptome annotations of multiple model organisms, and the results show that approximately 80% of the multi - transcriptome groups conform to this evolutionary model. In addition, 40% of the new isomers predicted by TENNIS have been verified by long - read - length RNA - seq data, indicating the effectiveness of this model in identifying missing transcripts.