Identification of Signal, Noise, and Indistinguishable Subsets in High-Dimensional Data Analysis

X. Jessie Jeng
DOI: https://doi.org/10.48550/arXiv.1305.0220
2013-05-02
Abstract:Motivated by applications in high-dimensional data analysis where strong signals often stand out easily and weak ones may be indistinguishable from the noise, we develop a statistical framework to provide a novel categorization of the data into the signal, noise, and indistinguishable subsets. The three-subset categorization is especially relevant under high-dimensionality as a large proportion of signals can be obscured by the large amount of noise. Understanding the three-subset phenomenon is important for the researchers in real applications to design efficient follow-up studies. %For example, candidates belonging to the signal subset may have priority for more focused study, while those in the noise subset can be removed; and, for candidates in the indistinguishable subset, additional data may be collected to further separate weak signals from the noise. We develop an efficient data-driven procedure to identify the three subsets. Theoretical study shows that, under certain conditions, only signals are included in the identified signal subset while the remaining signals are included in the identified indistinguishable subsets with high probability. Moreover, the proposed procedure adapts to the unknown signal intensity, so that the identified indistinguishable subset shrinks with the true indistinguishable subset when signals become stronger. The procedure is examined and compared with methods based on FDR control using Monte Carlo simulation. Further, it is applied successfully in a real-data application to identify genomic variants having different signal intensity.
Methodology
What problem does this paper attempt to address?