CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference

Anirban Mukherjee,Hannah Hanwen Chang
2024-04-12
Abstract:Social science research often hinges on the relationship between categorical variables and outcomes. We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold. Our theoretical and numerical analyses outline challenges posed by such categorical variables in causal inference. Specifically, dynamically varying and sparse levels can lead to violations of the Donsker conditions and a failure of the estimation functionals to converge to a tight Gaussian process. Traditional approaches, including the exclusion of rare categorical levels and principled variable selection models like LASSO, fall short. CAVIAR embeds the data into a lower-dimensional global coordinate system. The mapping can be derived from both structured and unstructured data, and ensures stable and robust estimates through dimensionality reduction. In a dataset of direct-to-consumer apparel sales, we illustrate how high-dimensional categorical variables, such as zip codes, can be succinctly represented, facilitating inference and analysis.
Econometrics,Machine Learning
What problem does this paper attempt to address?
The paper addresses the challenges of causal inference in social science research caused by high-dimensional and sparse categorical variables. Traditional methods such as excluding rare categories and using LASSO may fail when dealing with dynamic and sparse categorical variables. The paper proposes the CAVIAR method, which embeds the data into a low-dimensional global coordinate system to achieve stable and accurate estimation for high-dimensional categorical variables, such as representing postal codes in consumer sales data, thereby facilitating inference and analysis.