Abstract:The T-cell receptor (TCR) population in humans is comprised of highly diversified heterodimers, regulating the recognition of antigen-major histocompatibility complex. Tremendous TCR sequence diversity is produced by somatic recombination of several TCR gene loci each consisting of multiple gene segments. Next-generation sequencing has enabled comprehensive profiling of the TCR repertoire from different physiological and disease conditions ushering in much interest in using TCR-seq to assess T-cell diversity. However, during NGS library construction and sequencing, errors and enzymatic inefficiencies can compromise the accuracy of the final data, particularly around calling of the VD and VDJ recombined regions and subsequent clonotype assignment. To increase the accuracy of NGS sequencing, Unique Molecular Identifiers (UMIs), consisting of short random nucleotide bases, can be used to mark original molecules in NGS library allowing for error and bias corrections. There are two well studied technical limitations to apply UMIs: 1.) UMI sequences tend to collide when input molecule number is large 2.) UMI sequences are not insulated from PCR and sequencing errors. To address these limitations, many computational approaches had been published. Among them, very few can be used to solve UMI colliding errors and over-simplified error models were implemented for UMI sequencing error handling. Here we report a novel strategy and UMI structure which uses more complex UMIs that is longer and of different length. This results in minimizing UMI collision while maximizing sequencing quality. Our UMI analysis pipeline, "UMI-nea" is able to handle not only substitution errors but also indel errors and UMIs with different lengths. We developed a novel computational framework to parallelly process sequence comparisons to mitigate the elevated computational burden. To account for the varied dispersion of PCR efficiency for different molecules and error bearing UMIs from libraries with different input and with different sequencing depth, we also developed a statistical framework leveraging negative binomial model and single-cell knee plot to set a dynamic threshold for original molecule estimate. We verified UMI-nea with several simulated data and demonstrated that UMI-nea can achieve >99% completeness and homogeneity to recover the original molecule count with various error rates and UMI lengths, outperforming existing tools and methods in comparison. We applied UMI-nea to profile TCR for 8 PBMC samples sequenced on different Illumina platforms with different sequencing depths. We observed >85% reproducibility of clonotype calls on all samples. To test the sensitivity and specificity of UMI-nea, we sequenced pure cell line samples and cell line spike-in samples with different ratios and discovered very high recall and precision rates. Citation Format: Jixin Deng, Jingxiao Zhang, Song Tian, Samuel J. Rulli, Hong Xu, John DiCarlo, Eric Lader. UMI-nea: A fast and robust UMI analysis approach to accurately identify and quantify TCR repertoire from targeted RNA sequencing with wide range of input molecules [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 7425.

Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers

UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy

Abstract 7425: UMI-nea: A fast and robust UMI analysis approach to accurately identify and quantify TCR repertoire from targeted RNA sequencing with wide range of input molecules

Intrinsic molecular identifiers enable robust molecular counting in single-cell sequencing

Improving the Diversity of Captured Full-Length Isoforms Using a Normalized Single-Molecule RNA-sequencing Method

Correcting PCR amplification errors in unique molecular identifiers to generate accurate numbers of sequencing molecules

Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations—Application to HIV-1 quasispecies

UMI-count modeling and differential expression analysis for single-cell RNA sequencing

Q-RRBS: a Quantitative Reduced Representation Bisulfite Sequencing Method for Single-Cell Methylome Analyses.

OPUSeq Simplifies Detection of Low-Frequency DNA Variants and Uncovers Fragmentase-Associated Artifacts

IsoSeek for unbiased and UMI-informed sequencing of miRNAs from low input samples at single-nucleotide resolution

Gene length and detection bias in single cell RNA sequencing protocols

Principles of digital sequencing using unique molecular identifiers

Digital Rna Sequencing Minimizes Sequence-Dependent Bias And Amplification Noise With Optimized Single-Molecule Barcodes

How Error Correction Affects PCR Deduplication: A Survey Based on UMI Datasets of Short Reads

BID-seq for transcriptome-wide quantitative sequencing of mRNA pseudouridine at base resolution

Dual UMIs and Dual Barcodes With Minimal PCR Amplification Removes Artifacts and Acquires Accurate Antibody Repertoire

OUHP: an optimized universal hairpin primer system for cost-effective and high-throughput RT-qPCR-based quantification of microRNA (miRNA) expression

Evaluation of the Reproducibility of Amplicon Sequencing with Illumina MiSeq Platform

Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations

Comprehensive Multi-Center Assessment of Small RNA-seq Methods for Quantitative Mirna Profiling