Model-free feature screening based on Hellinger distance for ultrahigh dimensional data

Wu, Jiujing
DOI: https://doi.org/10.1007/s00362-024-01615-4
2024-11-03
Statistical Papers
Abstract:With the explosive development of data acquisition and processing technology, feature dimensions increase exponentially with sample size, posing significant challenges for data analysis. It is crucial to accurately identify useful features from thousands available. In this paper, we develop an omnibus model-free feature screening procedure based on the Hellinger distance, offering several appealing merits. First, we define the Hellinger distance index for discrete response variables in discriminant analysis. Second, this procedure consistently works for continuous response variables, where the responses are discretized using a slice-and-fused technique. Third, it is robust against potential outliers and model misspecification. Theoretically, the procedure for both discrete and continuous response variables exhibits sure screening and ranking consistency properties under mild conditions. Numerical studies show that this procedure is highly competitive in heavy-tailed and skewed data, as well as maintaining comparability with existing approaches for light-tailed data, indicating robust performance across various data types. The real data sets, one with discrete and the other with continuous response variables demonstrate the effectiveness of the proposed method.
statistics & probability
What problem does this paper attempt to address?