Abstract:Following the success of Cranfield-like evaluation approaches to evaluation in web search, web image search has also been evaluated with absolute judgments of (graded) relevance. However, recent research has found that collecting absolute relevance judgments may be difficult in image search scenarios due to the multi-dimensional nature of relevance for image results. Moreover, existing evaluation metrics based on absolute relevance judgments do not correlate well with search users' satisfaction perceptions in web image search. Unlike absolute relevance judgments, preference judgments do not require that relevance grades be pre-defined, i.e., how many levels to use and what those levels mean. Instead of considering each document in isolation, preference judgments consider a pair of documents and require judges to state their relative preference. Such preference judgments are usually more reliable than absolute judgments since the presence of (at least) two items establishes a certain context. While preference judgments have been studied extensively for general web search, there exists no thorough investigation on how preference judgments and preference-based evaluation metrics can be used to evaluate web image search systems. Compared to general web search, web image search may be an even better fit for preference-based evaluation because of its grid-based presentation style. The limited need for fresh results in web image search also makes preference judgments more reusable than for general web search. In this paper, we provide a thorough comparison of variants of preference judgments for web image search. We find that compared to strict preference judgments, weak preference judgments require less time and have better inter-assessor agreement. We also study how absolute relevance levels of two given images affect preference judgments between them. Furthermore, we propose a preference-based evaluation metric named Preference-Winning-Penalty (PWP) to evaluate and compare between two different image search systems. The proposed PWP metric outperforms existing evaluation metrics based on absolute relevance judgments in terms of agreement to system-level preferences of actual users.

Revisiting The Evaluation Of Diversified Search Evaluation Metrics With User Preferences

A Preference-oriented Diversity Model Based on Mutual-information in Re-ranking for E-commerce Search

User Preference Quantity Versus Recommendation Performance: A Preliminary Study

Result Diversification in Search and Recommendation: A Survey

Does Diversity Affect User Satisfaction in Image Search

A Subtopic Taxonomy-Aware Framework for Diversity Evaluation.

Directly Optimize Diversity Evaluation Measures: A New Approach to Search Result Diversification.

An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric

Towards Designing Better Session Search Evaluation Metrics

Cascade or Recency: Constructing Better Evaluation Metrics for Session Search

Structural Learning of Diverse Ranking.

Personalized Diversity Search Based on User’s Social Relationships

Evaluating Relevance Judgments with Pairwise Discriminative Power

Adapting Markov Decision Process for Search Result Diversification

Learning Maximal Marginal Relevance Model Via Directly Optimizing Diversity Evaluation Measures.

Preference-based Evaluation Metrics for Web Image Search

Recent Advances in Diversified Recommendation

Search results diversification for effective fair ranking in academic search

Evaluating Web Search with a Bejeweled Player Model.

A Learning Approach to Hierarchical Search Result Diversification.

New Metrics to Encourage Innovation and Diversity in Information Retrieval Approaches