Abstract:In gene set analysis, a researcher is typically confronted with many uncertainties in choosing the "right" method, corresponding parameter setting, and data preprocessing approach. We aim to offer guidance through these uncertainties by giving an overview of valid approaches and illustrations of their implementations in R. The results images in the graphical abstract were generated using the web‐based application Gene Set Enrichment Analysis (Subramanian et al., 2005, Proceedings of the National Academy of Sciences, 102(43):15545–15550; Mootha et al., 2003, Nature Genetics, 34(3):267–273). Gene set analysis (GSA), a popular approach for analyzing high‐throughput gene expression data, aims to identify sets of related genes that show significantly enriched or depleted expression patterns between different conditions. In the last years, a multitude of methods have been developed for this task. However, clear guidance is lacking: choosing the right method is the first hurdle a researcher is confronted with. No less challenging than overcoming this so‐called method uncertainty is the procedure of preprocessing, from knowing which steps are required to selecting a corresponding approach from the plethora of valid options to create the accepted input object (data preprocessing uncertainty), with clear guidance again being scarce. Here, we provide a practical guide through all steps required to conduct GSA, beginning with a concise overview of a selection of established methods, including Gene Set Enrichment Analysis and Database for Annotation, Visualization, and Integrated Discovery (DAVID). We thereby lay a special focus on reviewing and explaining the necessary preprocessing steps for each method under consideration (e.g., the necessity of a transformation of the RNA sequencing data)—an essential aspect that is typically paid only limited attention to in both existing reviews and applications. To raise awareness of the spectrum of uncertainties, our review is accompanied by an extensive overview of the literature on valid approaches for each step and illustrative R code demonstrating the complex analysis pipelines. It ends with a discussion and recommendations to both users and developers to ensure that the results of GSA are, despite the above‐mentioned uncertainties, replicable and transparent. This article is categorized under: Statistical and Graphical Methods of Data Analysis > Analysis of High Dimensional Data Data: Types and Structure > Data Preparation and Processing Applications of Computational Statistics > Genomics/Proteomics/Genetics

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

Gene Set linkage analysis: a tool for interpreting the overall functional impacts of observed transcriptomic changes

Gene Set Enrichment Analysis (GSEA) for Interpreting Gene Expression Profiles

Assessment of Gene Set Enrichment Analysis using curated RNA-seq-based benchmarks

GSVA: gene set variation analysis for microarray and RNA-Seq data

GAGE: generally applicable gene set enrichment for pathway analysis

Easy and efficient ensemble gene set testing with EGSEA

Human interactome resource and gene set linkage analysis for the functional interpretation of biologically meaningful gene sets.

Integration of Differential Gene-combination Search and Gene Set Enrichment Analysis: A General Approach

Abstract 4281: Data driven refinement of gene signatures for enrichment analysis and cell state characterization

Abstract 2343: Adapting gene set enrichment analysis to single cell data

Data driven refinement of gene expression signatures for enrichment analysis

GS-TCGA: Gene Set-Based Analysis of The Cancer Genome Atlas

Toward a gold standard for benchmarking gene set enrichment analysis

From RNA sequencing measurements to the final results: A practical guide to navigating the choices and uncertainties of gene set analysis

OncoEnrichR: cancer-dedicated gene set interpretation

DeepGSEA: explainable deep gene set enrichment analysis for single-cell transcriptomic data

GSCA: an integrated platform for gene set cancer analysis at genomic, pharmacogenomic and immunogenomic levels

Bayesian Gene Set Analysis

GIGSEA: Genotype Imputed Gene Set Enrichment Analysis Using GWAS Summary Level Data

Improving Detection of Differentially Expressed Gene Sets by Applying Cluster Enrichment Analysis to Gene Ontology