Abstract:Confronted with the challenge of identifying the most suitable metric to validate the merits of newly proposed models, the decision-making process is anything but straightforward. Given that comparing rankings introduces its own set of formidable challenges and the likely absence of a universal metric applicable to all scenarios, the scenario does not get any better. Furthermore, metrics designed for specific contexts, such as for Recommender Systems, sometimes extend to other domains without a comprehensive grasp of their underlying mechanisms, resulting in unforeseen outcomes and potential misuses. Complicating matters further, distinct metrics may emphasize different aspects of rankings, frequently leading to seemingly contradictory comparisons of model results and hindering the trustworthiness of evaluations. We unveil these aspects in the domain of ranking evaluation metrics. Firstly, we show instances resulting in inconsistent evaluations, sources of potential mistrust in commonly used metrics; by quantifying the frequency of such disagreements, we prove that these are common in rankings. Afterward, we conceptualize rankings using the mathematical formalism of symmetric groups detaching from possible domains where the metrics have been created; through this approach, we can rigorously and formally establish essential mathematical properties for ranking evaluation metrics, essential for a deeper comprehension of the source of inconsistent evaluations. We conclude with a discussion, connecting our theoretical analysis to the practical applications, highlighting which properties are important in each domain where rankings are commonly evaluated. In conclusion, our analysis sheds light on ranking evaluation metrics, highlighting that inconsistent evaluations should not be seen as a source of mistrust but as the need to carefully choose how to evaluate our models in the future.

Pairwise, Magnitude, or Stars: What's the Best Way for Crowds to Rate?

A Rating-Ranking Method for Crowdsourced Top-k Computation.

Crowdsourcing subjective annotations using pairwise comparisons reduces bias and error compared to the majority-vote method

Crowdsourcing with Difficulty: A Bayesian Rating Model for Heterogeneous Items

IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons

Do I Follow My Friends or the Crowd? Information Cascades in Online Movie Ratings

Scaling preferences using probabilistic choice models: is there a ratio-scale representation of subjective liking?

Simple Surveys: Response Retrieval Inspired by Recommendation Systems

Performance Comparison of Algorithms for Movie Rating Estimation

Approximating Metric Magnitude of Point Sets

Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction

How Many Ratings per Item are Necessary for Reliable Significance Testing?

A practical guide and software for analysing pairwise comparison experiments

Listwise Approach For Rank Aggregation In Crowdsourcing

Ranking a Set of Objects using Heterogeneous Workers: QUITE an Easy Problem

On the equivalence of two mixture models for rating data

Ranking evaluation metrics from a group-theoretic perspective

Spot Check Equivalence: an Interpretable Metric for Information Elicitation Mechanisms

Ranking Consistent Rate: New Evaluation Criterion on Pairwise Subjective Experiments

A recommender network perspective on the informational value of critics and crowds

User tendency-based rating scaling in online trading networks