The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models

Yushu Shi,Michael Martens
2023-10-08
Abstract:In modern scientific research, the objective is often to identify which variables are associated with an outcome among a large class of potential predictors. This goal can be achieved by selecting variables in a manner that controls the the false discovery rate (FDR), the proportion of irrelevant predictors among the selections. Knockoff filtering is a cutting-edge approach to variable selection that provides FDR control. Existing knockoff statistics frequently employ linear models to assess relationships between features and the response, but the linearity assumption is often violated in real world applications. This may result in poor power to detect truly prognostic variables. We introduce a knockoff statistic based on the conditional prediction function (CPF), which can pair with state-of-art machine learning predictive models, such as deep neural networks. The CPF statistics can capture the nonlinear relationships between predictors and outcomes while also accounting for correlation between features. We illustrate the capability of the CPF statistics to provide superior power over common knockoff statistics with continuous, categorical, and survival outcomes using repeated simulations. Knockoff filtering with the CPF statistics is demonstrated using (1) a residential building dataset to select predictors for the actual sales prices and (2) the TCGA dataset to select genes that are correlated with disease staging in lung cancer patients.
Methodology,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to effectively control the False Discovery Rate (FDR) when performing variable selection in high - dimensional data. Specifically, when researchers need to identify variables related to the outcome from a large number of potential predictors, although the traditional Knockoff Filtering method can provide FDR control, these methods are usually based on linear model assumptions, which are often violated in real - world applications, resulting in poor ability to detect true prognostic variables. To solve this problem, the author proposes a new knockoff statistic based on the Conditional Prediction Function (CPF), which can be combined with advanced machine - learning prediction models (such as deep neural networks) to capture the nonlinear relationship between predictors and outcomes while considering the correlation between features. Through this method, the paper aims to improve the efficiency of detecting true prognostic variables in different types of outcomes (continuous, categorical, survival), while maintaining effective control of the FDR.