Abstract:Variable selection has been widely used in data analysis for the past decades, and it becomes increasingly important in the Big Data era as there are usually hundreds of variables available in a dataset. To enhance interpretability of a model, identifying potentially relevant features is often a step before fitting all the features into a regression model. A good variable selection method should effectively control the fraction of false discoveries and ensure large enough power of its selection set. In a lot of contemporary data applications, a great portion of features are coded as binary variables. Binary features are widespread in many fields, from online controlled experiments to genome science to physical statistics. Although there has recently been a handful of literature for provable false discovery rate (FDR) control in variable selection, most of the theoretical analyses were based on some strong dependency assumption or Gaussian assumption among features. In this paper we propose a variable selection method in regression framework for selecting binary features. Under mild conditions, we show that FDR is controlled exactly under a target level in a finite sample if the underlying distribution of the binary features is known. We show in simulations that FDR control is still attained when feature distribution is estimated from data. We also provide theoretical results on the power of our variables selection method in a linear regression model or a logistic regression model. In the restricted settings where competitors exist, we show in simulations and real data application on a HIV antiretroviral therapy dataset that our method has higher power than the competitor.

Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter-rater reliability

Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter‐rater reliability

Fairness in Risk Assessment Instruments: Post-Processing to Achieve Counterfactual Equalized Odds

Student's t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce

Interrater agreement statistics under the two-rater dichotomous-response case with correlated decisions

Evaluating inter-rater reliability in the context of "Sysmex UN2000 detection of protein/creatinine ratio and of renal tubular epithelial cells can be used for screening lupus nephritis": a statistical examination

Interrater reliability for multilevel data: A generalizability theory approach.

Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters

Intrinsic Fairness-Accuracy Tradeoffs under Equalized Odds

Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations

Probabilities of true and false decisions in conformity assessment of a finite sample of items

Selection by Prediction with Conformal p-values

Local False Discovery Rate Estimation with Competition-Based Procedures for Variable Selection

Rethinking the Funding Line at the Swiss National Science Foundation: Bayesian Ranking and Lottery

Optimal ROC-Based Classification and Performance Analysis under Bayesian Uncertainty Models

Nonparametric Assessment of Variable Selection and Ranking Algorithms

Fair prediction with disparate impact: A study of bias in recidivism prediction instruments

Reranking individuals: The effect of fair classification within-groups

Controlling the False Discovery Rate for Binary Feature Selection via Knockoff

Reverse Information Projections and Optimal E-statistics