A Study of Per-Topic Variance on System Comparison.

Meng Yang,Peng Zhang,Dawei Song
DOI: https://doi.org/10.1145/3209978.3210122
2018-01-01
Abstract:Under the notion that the document collection is a sample from a population, the observed per-topic metric (e.g., AP) value varies with different samples, leading to the per-topic variance. The results of the system comparison, such as comparing the ranking of systems according to the summary metric (e.g., MAP) or testing whether there is significant difference between two systems, are affected by the variability of per-topic metric values. In this paper, we study the effect of per-topic variance on the system comparison. To measure such effects, we employ two ranking-based methods, i.e., Error Rate (ER) and Kendall Rank Correlation Coefficient (KRCC), as well as two significance test based methods, namely Achieved Significance Level (ASL) and Estimated Difference (ED). We conduct empirical comparison of TREC participated systems on Robust and Adhoc track, which shows that the effect of per-topic variance on the ranking of systems is not obvious, while the significance test based comparisons are susceptible to the per-topic variance.
What problem does this paper attempt to address?