Abstract:Currently, there is an urgent demand for scalable multivariate and high-dimensional false discovery rate (FDR)-controlling variable selection methods to ensure the repro-ducibility of discoveries. However, among existing methods, only the recently proposed Terminating-Random Experiments (T-Rex) selector scales to problems with millions of variables, as encountered in, e.g., genomics research. The T-Rex selector is a new learning framework based on early terminated random experiments with computer-generated dummy variables. In this work, we propose the Big T-Rex, a new implementation of T-Rex that drastically reduces its Random Access Memory (RAM) consumption to enable solving FDR-controlled sparse regression problems with millions of variables on a laptop. We incorporate advanced memory-mapping techniques to work with matrices that reside on solid-state drive and two new dummy generation strategies based on permutations of a reference matrix. Our nu-merical experiments demonstrate a drastic reduction in memory demand and computation time. We showcase that the Big T-Rex can efficiently solve FDR-controlled Lasso-type problems with five million variables on a laptop in thirty minutes. Our work empowers researchers without access to high-performance clusters to make reproducible discoveries in large-scale high-dimensional data.

What problem does this paper attempt to address?

This paper attempts to solve the problem of how to control the False Discovery Rate (FDR) when performing variable selection in high - dimensional data. Specifically, the goal of the paper is to develop a method that can handle high - dimensional regression problems with millions of variables on an ordinary laptop and ensure the repeatability of the selected variables. ### Main problems 1. **FDR control in high - dimensional data**: - In high - dimensional data, the number of variables far exceeds the number of observations ($p \gg n$), which makes traditional FDR control methods difficult to apply. - The paper points out that existing FDR control methods have limitations when dealing with large - scale data, especially when computing resources are limited. 2. **Limitations of computing resources**: - Processing data sets with millions of variables requires a large amount of memory and computing power, and most researchers may not have access to high - performance computing clusters (HPC). - Therefore, a method that can run efficiently on an ordinary laptop is needed. ### Solutions To solve the above problems, the paper proposes **Big T - Rex**, which is an improved implementation of the existing T - Rex selector. The main contributions include: 1. **Reducing memory consumption**: - By using memory mapping technology, data is stored on a solid - state drive (SSD) instead of directly occupying RAM. - This can significantly reduce memory requirements, making it possible to process large - scale data on an ordinary laptop. 2. **Efficient dummy variable generation strategies**: - Two new dummy variable generation strategies (based on permutations of the reference matrix) are proposed, thereby reducing the need to store multiple dummy matrices. - These strategies not only reduce memory consumption but also improve computational efficiency. 3. **Experimental verification**: - The effectiveness of Big T - Rex is verified through numerical experiments, demonstrating its performance advantages in handling sparse regression problems with 5 million variables. - The experimental results show that Big T - Rex can complete the calculation within 30 minutes, and the required memory and computing time are greatly reduced. ### Conclusions Through these improvements, Big T - Rex enables researchers without high - performance computing resources to perform high - dimensional variable selection tasks with FDR control on an ordinary laptop. This result is of great significance for research in fields such as genomics and proteomics, and can help researchers analyze large - scale high - dimensional data more efficiently and obtain reliable scientific discoveries.

Solving FDR-Controlled Sparse Regression Problems with Five Million Variables on a Laptop

The Terminating-Random Experiments Selector: Fast High-Dimensional Variable Selection with False Discovery Rate Control

High-Dimensional False Discovery Rate Control for Dependent Variables

FDR-Controlled Portfolio Optimization for Sparse Financial Index Tracking

Sparse PCA with False Discovery Rate Controlled Variable Selection

Black Box FDR

Controlling the False Discovery Rate in Subspace Selection

Statistically Guided Divide-and-Conquer for Sparse Factorization of Large Matrix

An Efficient Sufficient Dimension Reduction Method for Identifying Genetic Variants of Clinical Significance

Local False Discovery Rate Estimation with Competition-Based Procedures for Variable Selection

A more practical approach for the Benjamini-Hochberg FDR controlling procedure for huge-scale testing problems

FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs

A Multilayered-and-Randomized Latent Factor Model for High-Dimensional and Sparse Matrices

An automated exact solution framework towards solving the logistic regression best subset selection problem

LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK

Efficient and Doubly-Robust Methods for Variable Selection and Parameter Estimation in Longitudinal Data Analysis

Controlling the False Discovery Rate for Binary Feature Selection via Knockoff

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Scalable Sparse Regression for Model Discovery: The Fast Lane to Insight

Directional FDR Control for Sub-Gaussian Sparse GLMs

Supervised dimensionality reduction for big data