Abstract:The unbiased learning to rank (ULTR) problem has been greatly advanced by recent deep learning techniques and well-designed debias algorithms. However, promising results on the existing benchmark datasets may not be extended to the practical scenario due to the following disadvantages observed from those popular benchmark datasets: (1) outdated semantic feature extraction where state-of-the-art large scale pre-trained language models like BERT cannot be exploited due to the missing of the original text;(2) incomplete display features for in-depth study of ULTR, e.g., missing the displayed abstract of documents for analyzing the click necessary bias; (3) lacking real-world user feedback, leading to the prevalence of synthetic datasets in the empirical study. To overcome the above disadvantages, we introduce the Baidu-ULTR dataset. It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries, which is orders of magnitude larger than the existing ones. Baidu-ULTR provides:(1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract, enabling the comprehensive study of different biases with advanced techniques such as causal discovery and meta-learning; and (3) rich user feedback on search result pages (SERPs) like dwelling time, allowing for user engagement optimization and promoting the exploration of multi-task learning in ULTR. In this paper, we present the design principle of Baidu-ULTR and the performance of benchmark ULTR algorithms on this new data resource, favoring the exploration of ranking for long-tail queries and pre-training tasks for ranking. The Baidu-ULTR dataset and corresponding baseline implementation are available at <a class="link-external link-https" href="https://github.com/ChuXiaokai/baidu_ultr_dataset" rel="external noopener nofollow">this https URL</a>.

Sogou-QCL

Domain-specific Cross-Language Relevant Question Retrieval.

Training Deep Ranking Model with Weak Relevance Labels

SogouQ: The First Large-Scale Test Collection with Click Streams Used in a Shared-Task Evaluation

Investigating Weak Supervision in Deep Ranking.

Sogou-ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions

Quda: Natural Language Queries for Visual Data Analytics

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Query clustering for learning to rank models on web search

A Large Scale Search Dataset for Unbiased Learning to Rank

Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking

Relevance Estimation with Multiple Information Sources on Search Engine Result Pages.

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Response Enhanced Semi-supervised Dialogue Query Generation

Automatic Search Engine Performance Evaluation With The Wisdom Of Crowds

$Q_{bias}$ -- A Dataset on Media Bias in Search Queries and Query Suggestions

Pretrained Language Model based Web Search Ranking: From Relevance to Satisfaction

Learning to Rank Collections.

Explicit and Implicit Semantic Ranking Framework

SogouT-16