Marc Lanctot,Kate Larson,Michael Kaisers,Quentin Berthet,Ian Gemp,Manfred Diaz,Roberto-Rafael Maura-Rivero,Yoram Bachrach,Anna Koop,Doina Precup
Abstract:A common way to drive progress of AI models and agents is to compare their performance on standardized benchmarks. Comparing the performance of general agents requires aggregating their individual performances across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet's original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59\% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to rank general agents in a variety of different tasks. Specifically, the researchers proposed a new ranking scheme - Soft Condorcet Optimization (SCO), aiming to calculate the optimal ranking by minimizing errors in predicting agent comparisons.
### Problem Background
In the field of artificial intelligence, evaluating the performance of agents usually depends on standardized benchmark tests. For general agents, their performance needs to be aggregated and evaluated in various different tasks. Although traditional scoring systems (such as the Elo system) are widely used in pairwise - comparison competition scenarios, they have limitations in dealing with multi - task and multi - agent situations, especially when the data is incomplete or unevenly distributed.
### Main Contributions of the Paper
1. **Introduction of the SCO Ranking Scheme**:
- Three optimization methods are proposed to find scores and corresponding rankings: the gradient descent method (based on the soft Kendall - tau distance "sigmoid loss"), the Fenchel - Young loss (perturbation optimization), and using the branch - and - bound method to solve the sigmoidal program.
- An online form is provided, which can update scores from a single result, thereby adjusting rankings in real - time.
- It is proved that when there is a Condorcet winner, the highest - ranked agent obtained according to the sigmoid loss is the Condorcet winner.
2. **Empirical Evaluation**:
- Demonstrates the ability of SCO to handle the failure modes of the classic Elo scoring system. Even when there is a Condorcet winner, Elo may rank it not in the first place.
- In a noisy tournament setting, when most of the data is missing, SCO can better approximate the real ranking.
- In the Diplomacy game, the SCO score is closer to the optimal ranking than Elo and other voting evaluation methods.
### Mathematical Formula Representation
- **Kendall - tau Distance**:
\[
K_d(\pi_1, \pi_2) = \sum_{\{i,j\} \in C_2(S_1)} \bar{K}_{i,j}(\pi_1, \pi_2)
\]
where \(C_2(S)\) is the unordered pair combination of the set \(S\), and \(\bar{K}_{i,j}(\pi_1, \pi_2)\) indicates whether elements \(i\) and \(j\) maintain the same order in permutations \(\pi_1\) and \(\pi_2\).
- **Normalized Kendall - tau Distance**:
\[
K_n(\pi_1, \pi_2) = \frac{2K_d(\pi_1, \pi_2)}{|S_1|(|S_1|-1)}
\]
- **SCO Loss Function**:
\[
\tilde{L}([\succeq], A, V, \theta) = \sum_{v \in [\succeq]} \sum_{(i,j) \in I_2(v)} \tilde{D}_v(\theta_v[i], \theta_v[j])
\]
where \(\tilde{D}_v(\theta_a, \theta_b) = \sigma(\theta_b - \theta_a) = \frac{1}{1 + e^{(\theta_a - \theta_b)/\tau}}\).
### Summary
By introducing the Soft Condorcet Optimization (SCO) method, the paper solves the problem of effectively ranking general agents in a multi - task environment. SCO can not only handle incomplete data sets but also ensure that the correct ranking is given when there is a Condorcet winner, thereby improving the accuracy and robustness of the evaluation.