Abstract:Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that similar to 3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, similar to 10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that > 16.3% of them are false positives.

Instance-based Error Correction for Short Reads of Disease-Associated Genes.

Turn ‘noise’ to signal: accurately rectify millions of erroneous short reads through graph learning on edit distances

Identification Of Sequence Variants In Genetic Disease-Causing Genes Using Targeted Next-Generation Sequencing

How Error Correction Affects PCR Deduplication: A Survey Based on UMI Datasets of Short Reads

Bi-Level Error Correction for PacBio Long Reads

Error filtering, pair assembly and error correction for next-generation sequencing reads

High efficiency error suppression for accurate detection of low-frequency variants

Enhanced Error Suppression for Accurate Detection of Low‐Frequency Variants

ReadsClean: a new approach to error correction of sequencing reads based on alignments clustering

Analysis of error profiles in deep next-generation sequencing data

Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat

Improving sequence-based genotype calls with linkage disequilibrium and pedigree information

Comprehensive assessment of error correction methods for high-throughput sequencing data

Biases and errors on allele frequency estimation and disease association tests of next-generation sequencing of pooled samples.

Integration of Hybrid and Self-Correction Method Improves the Quality of Long-Read Sequencing Data.

One Size Doesn't Fit All - RefEditor: Building Personalized Diploid Reference Genome to Improve Read Mapping and Genotype Calling in Next Generation Sequencing Studies.

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment.

Single-sample SNP Detection by Empirical Bayesian Method Using Next Generation Sequencing Data

Analysis of Mutational Genotyping Using Correctable Decoding Sequencing with Superior Specificity

Efficient Frequency-Based De Novo Short-Read Clustering for Error Trimming in Next-Generation Sequencing

A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms