riboCleaner: a pipeline to identify and quantify rRNA read contamination from RNA-seq data in plants

Pu Huang,Erin Davis,Xia Cao,Hunter J Cameron
DOI: https://doi.org/10.1093/bioinformatics/btac402
IF: 5.8
2022-06-24
Bioinformatics
Abstract:Analysis of gene expression data can be crucial for elucidating biological relationships within living organisms. However, accurate quantification of gene expression relies directly upon the accuracy of the reference genome or transcriptome to which the expression data is mapped. Errors in gene annotation can lead to errors in quantification of gene expression. One source of gene annotation error in eukaryotes arises from incorrect predictions of mRNA gene models within ribosomal DNA (rDNA) regions. Here, we provide examples of how the presence of false gene models in rDNA regions can result in a handful of genes appearing to contribute to > 50% of the total transcripts per million (TPM) values of entire RNA-seq datasets. To this end, we have created riboCleaner, a bioinformatics pipeline designed to identify misannotated gene models in rDNA regions and quantify rRNA-derived reads in RNA-seq data. We also show the applicability of riboCleaner in several plant genome assemblies. We have implemented riboCleaner as a containerized Snakemake workflow. The workflow, instructions for building the container, and other documentation is available at https://github.com/basf. For convenience, a prebuilt Docker image containing riboCleaner is available at https://hub.docker.com/u/basfcontainers. Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?