RNA-seq data science: From raw data to effective interpretation

Dhrithi Deshpande,Karishma Chhugani,Yutong Chang,Aaron Karlsberg,Caitlin Loeffler,Jinyang Zhang,Agata Muszynska,Jeremy Rotman,Laura Tao,Brunilda Balliu,Elizabeth Tseng,Eleazar Eskin,Fangqing Zhao,Pejman Mohammadi,Pawel P Labaj,Serghei Mangul
DOI: https://doi.org/10.48550/arXiv.2010.02391
2021-02-17
Abstract:RNA-sequencing (RNA-seq) has become an exemplar technology in modern biology and clinical applications over the past decade. It has gained immense popularity in the recent years driven by continuous efforts of the bioinformatics community to develop accurate and scalable computational tools. RNA-seq is a method of analyzing the RNA content of a sample using the modern sequencing platforms. It generates enormous amounts of transcriptomic data in the form of nucleotide sequences, known as reads. RNA-seq analysis enables the probing of genes and corresponding transcripts which is essential for answering important biological questions, such as detecting novel exons, transcripts, gene expressions, and studying alternative splicing structure. However, obtaining meaningful biological signals from raw data using computational methods is challenging due to the limitations of modern sequencing technologies. The need to leverage these technological challenges have pushed the rapid development of many novel computational tools which have evolved and diversified in accordance with technological advancements, leading to the current myriad population of RNA-seq tools. Our review provides a systemic overview of RNA-seq technology and 235 available RNA-seq tools across various domains published from 2008 to 2020, discussing the interdisciplinary nature of bioinformatics involved in RNA sequencing, analysis, and software development.
Genomics
What problem does this paper attempt to address?
This paper aims to address the challenges in RNA - sequencing (RNA - seq) data analysis, especially in the process from raw data to effective interpretation. Although RNA - seq technology has become a key technology in modern biology and clinical applications in the past decade, obtaining meaningful biological signals from raw data remains challenging due to the limitations of modern sequencing technologies. These problems include sequencing errors, length bias, and fragmentation, etc. To overcome these technical challenges, researchers have developed many new computational tools, which have continuously evolved and developed with the progress of technology, forming the current diverse RNA - seq tool ecosystem. Specifically, this paper attempts to solve the problems in the following aspects: 1. **Providing a systematic overview**: The paper systematically reviews RNA - seq technology and its related computational tools, covering 235 RNA - seq tools released between 2008 and 2020, and discusses the interdisciplinary nature of bioinformatics in RNA sequencing, analysis, and software development. 2. **Exploring the development of computational tools**: The paper evaluates the average annual growth rate of computational tools for RNA - seq analysis, discusses the most popular tools in various fields, and analyzes the usability and archive stability of these tools. 3. **Introducing the data analysis process**: From generating raw data to effectively interpreting and visualizing data, the paper details each step of RNA - seq analysis, including data quality control, read alignment, and quantification techniques. 4. **Addressing specific technical challenges**: The paper discusses how to handle the different advantages and limitations of short and long reads, and how to improve the accuracy and efficiency of analysis by combining the advantages of both through a hybrid method. 5. **Gene expression quantification analysis**: The paper explores how to use computational methods to estimate the expression levels of transcripts and genes, especially how to handle the problem that short reads cannot be uniquely assigned to specific transcripts. 6. **Differential gene expression analysis**: The paper introduces how to use statistical methods to detect significant differences in gene and transcript expression levels between different experimental groups, including how to reduce noise and false - positive results. 7. **Allele - specific expression measurement**: The paper also discusses how to use RNA - seq data to study allele - specific expression (ASE) to study the cis - regulatory effects of genetic variation. In summary, through a comprehensive review and analysis of RNA - seq technology and its computational tools, this paper aims to provide researchers with a more systematic framework for more effectively using existing computational tools for RNA - seq data analysis.