Solving FDR-Controlled Sparse Regression Problems with Five Million Variables on a Laptop

Fabian Scheidt,Jasin Machkour,Michael Muma
DOI: https://doi.org/10.1109/CAMSAP58249.2023.10403478
2024-09-28
Abstract:Currently, there is an urgent demand for scalable multivariate and high-dimensional false discovery rate (FDR)-controlling variable selection methods to ensure the repro-ducibility of discoveries. However, among existing methods, only the recently proposed Terminating-Random Experiments (T-Rex) selector scales to problems with millions of variables, as encountered in, e.g., genomics research. The T-Rex selector is a new learning framework based on early terminated random experiments with computer-generated dummy variables. In this work, we propose the Big T-Rex, a new implementation of T-Rex that drastically reduces its Random Access Memory (RAM) consumption to enable solving FDR-controlled sparse regression problems with millions of variables on a laptop. We incorporate advanced memory-mapping techniques to work with matrices that reside on solid-state drive and two new dummy generation strategies based on permutations of a reference matrix. Our nu-merical experiments demonstrate a drastic reduction in memory demand and computation time. We showcase that the Big T-Rex can efficiently solve FDR-controlled Lasso-type problems with five million variables on a laptop in thirty minutes. Our work empowers researchers without access to high-performance clusters to make reproducible discoveries in large-scale high-dimensional data.
Signal Processing,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to control the False Discovery Rate (FDR) when performing variable selection in high - dimensional data. Specifically, the goal of the paper is to develop a method that can handle high - dimensional regression problems with millions of variables on an ordinary laptop and ensure the repeatability of the selected variables. ### Main problems 1. **FDR control in high - dimensional data**: - In high - dimensional data, the number of variables far exceeds the number of observations ($p \gg n$), which makes traditional FDR control methods difficult to apply. - The paper points out that existing FDR control methods have limitations when dealing with large - scale data, especially when computing resources are limited. 2. **Limitations of computing resources**: - Processing data sets with millions of variables requires a large amount of memory and computing power, and most researchers may not have access to high - performance computing clusters (HPC). - Therefore, a method that can run efficiently on an ordinary laptop is needed. ### Solutions To solve the above problems, the paper proposes **Big T - Rex**, which is an improved implementation of the existing T - Rex selector. The main contributions include: 1. **Reducing memory consumption**: - By using memory mapping technology, data is stored on a solid - state drive (SSD) instead of directly occupying RAM. - This can significantly reduce memory requirements, making it possible to process large - scale data on an ordinary laptop. 2. **Efficient dummy variable generation strategies**: - Two new dummy variable generation strategies (based on permutations of the reference matrix) are proposed, thereby reducing the need to store multiple dummy matrices. - These strategies not only reduce memory consumption but also improve computational efficiency. 3. **Experimental verification**: - The effectiveness of Big T - Rex is verified through numerical experiments, demonstrating its performance advantages in handling sparse regression problems with 5 million variables. - The experimental results show that Big T - Rex can complete the calculation within 30 minutes, and the required memory and computing time are greatly reduced. ### Conclusions Through these improvements, Big T - Rex enables researchers without high - performance computing resources to perform high - dimensional variable selection tasks with FDR control on an ordinary laptop. This result is of great significance for research in fields such as genomics and proteomics, and can help researchers analyze large - scale high - dimensional data more efficiently and obtain reliable scientific discoveries.