Abstract:Many important tasks of large-scale recommender systems can be naturally cast as testing multiple linear forms for noisy matrix completion. These problems, however, present unique challenges because of the subtle bias-and-variance tradeoff of and an intricate dependence among the estimated entries induced by the low-rank structure. In this paper, we develop a general approach to overcome these difficulties by introducing new statistics for individual tests with sharp asymptotics both marginally and jointly, and utilizing them to control the false discovery rate (FDR) via a data splitting and symmetric aggregation scheme. We show that valid FDR control can be achieved with guaranteed power under nearly optimal sample size requirements using the proposed methodology. Extensive numerical simulations and real data examples are also presented to further illustrate its practical merits.
What problem does this paper attempt to address?
This paper aims to solve the problems encountered in multi - linear form testing in large - scale recommendation systems. In particular, when the data matrix is noisy and partially observed, how to effectively control the False Discovery Rate (FDR) while ensuring the detection ability. Specifically, the paper focuses on how to develop a general method to overcome these difficulties under the complex dependencies between the estimation terms induced by the low - rank structure and the subtle bias - variance trade - off.
### Main problems in the paper
1. **Bias - variance trade - off in multi - hypothesis testing**:
- One of the main challenges in multiple linear form testing in recommendation systems is how to handle the complex dependencies between the estimation terms caused by the low - rank structure and how to find a balance between bias and variance.
2. **Controlling the False Discovery Rate (FDR)**:
- The paper proposes a new statistic for individual tests and controls FDR through data splitting and symmetric aggregation schemes. This method can achieve effective FDR control with almost optimal sample size requirements while ensuring the detection ability.
3. **Improving the detection ability**:
- To improve the detection ability, the paper introduces a more accurate variance estimation method, which makes the new statistic converge to the normal distribution more quickly both marginally and jointly, and is thus more suitable for multiple testing.
### Solutions
- **Introduction of a new statistic**:
- The paper proposes a new statistic, which is based on the latest developments in single - entry inference and, through more accurate variance characterization, makes the statistic converge to the normal distribution at a faster rate both marginally and jointly.
- **Data splitting and symmetric aggregation**:
- Through data splitting and symmetric aggregation schemes, the paper shows how to use these new statistics to control FDR. Specifically, the data is divided into two sub - samples, which are used to generate two independent symmetric statistics respectively, and then these statistics are aggregated in a product - based manner to further improve the detection ability.
- **Explicit correlation characterization**:
- The paper also analyzes in detail the dependencies between different test statistics, and by explicitly characterizing these correlations, proposes "whitening" and "screening" methods to further relax the FDR control conditions.
### Theoretical guarantees
- **FDR control**:
- Under certain conditions, the paper proves that the proposed multiple - testing method can effectively control FDR, and in the case of strong signals, the detection ability is close to 1.
- **Sample size and signal - to - noise ratio requirements**:
- The paper points out that the sample size and signal - to - noise ratio requirements of its method are comparable to those of existing estimation methods, which indicates that under weakly correlated conditions, as long as the underlying matrix can be consistently recovered, FDR can be effectively controlled.
In summary, through the introduction of new statistics and improved multiple - testing methods, this paper solves the key problems in multi - linear form testing in large - scale recommendation systems, especially making significant progress in controlling FDR and improving the detection ability.