Explicit Scale Simulation for analysis of RNA-sequencing with ALDEx2

Gregory B Gloor,Michelle Pistner Nixon,Justin D. Silverman
DOI: https://doi.org/10.1101/2023.10.21.563431
2024-11-24
Abstract:In high-throughput sequencing (HTS) studies, sample-to-sample variation in sequencing depth is driven by technical factors, and not by variation in the scale (e.g., total size, microbial load, or total mRNA expression) of the underlying biological systems. Typically a statistical normalization is used to remove unwanted technical variation in the data or the parameters of the model to enable analyses that are reliant on scale; e.g., differential abundance and differential expression analyses. We recently showed that all normalizations make implicit assumptions about the unmeasured system scale and that errors in these assumptions can dramatically increase false positive and false negative rates. We demonstrated that these errors can be mitigated by accounting for uncertainty about scale using a scale model, which we integrated into the ALDEx2 R package. This article provides new insights into those methods, focusing on the application to transcriptomic analysis. Here we provide transcriptomic case studies demonstrating how scale models, rather than traditional normalizations, can reduce false positive and false negative rates in practice while enhancing the transparency and reproducibility of analyses. We show that these scale models replace the need for dual cutoff approaches often used to address the disconnect between practical and statistical significance. We demonstrate the utility of that scale models built based on known housekeeping genes in complex metatranscriptomic datasets. Thus this work provides example and practical guidance on how to incorporate scale into transcriptomic analysis.
Biology
What problem does this paper attempt to address?