Abstract:While Learning to Rank (LTR) models on top of transformers have been widely adopted to achieve decent performance, it is still challenging to train the model with sufficient data as only an extremely small number of query-webpage pairs could be annotated versus trillions of webpages available online and billions of web search queries everyday. In the meanwhile, industry research communities have released a number of open-source LTR datasets with well annotations but incorporating different designs of LTR features/labels (i.e., heterogeneous domains). In this work, inspired by the recent progress in pre-training transformers for performance advantages, we study the problem of pre-training LTR models using both labeled and unlabeled samples, especially we focus on the use of well-annotated samples in heterogeneous open-source LTR datasets to boost the performance of pre-training. Hereby, we propose S 2 phere-Semi-Supervised Pre-training with Heterogeneous LTR data strategies for LTR models using both unlabeled and labeled query-webpage pairs across heterogeneous LTR datasets. S 2 phere consists of a three-step approach: (1) Semi-supervised Feature Extraction Pre-training via Perturbed Contrastive Loss, (2) Cross-domain Ranker Pre-training over Heterogeneous LTR Datasets and (3) End-to-end LTR Fine-tuning via Modular Network Composition. Specifically, given an LTR model composed of a backbone (the feature extractor), a neck (the module to reason the orders) and a head (the predictor of ranking scores), S 2 phere uses unlabeled/labeled data from the search engine to pre-train the backbone in Step (1) via semi-supervised learning; then Step (2) incorporates multiple open-source heterogeneous LTR datasets to improve pre-training of the neck module as shared parameters of cross-domain learning; and finally, S2phere in Step (3) composes the backbone and neck with a randomly-initialized head into a whole LTR model and fine-tunes the model using search engine data with various learning strategies. Extensive experiments have been done with both offline experiments and online A/B Test on top of Baidu search engine. The comparisons against numbers of baseline algorithms confirmed the advantages of S 2 phere in producing high-performance LTR models for web-scale search.

Construct Training Set for Learning to Rank in Web Search

Is learning to rank effective for Web search?

Selecting optimal training data for learning to rank

Learning with Both Unlabeled Data and Query Logs for Image Search.

LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval

A Simple yet Effective Framework for Active Learning to Rank

Query clustering for learning to rank models on web search

LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval.

Learning to rank relational objects and its application to web search.

Learning to Rank from Noisy Data

Online Learning of Optimally Diverse Rankings

Learning to Rank with Small Set of Ground Truth Data

Learning to Rank Collections.

QoRank: A Query-Dependent Ranking Model Using LSE-based Weighted Multiple Hyperplanes Aggregation for Information Retrieval

SERank: Optimize Sequencewise Learning to Rank Using Squeeze-and-Excitation Network

Meta Learning to Rank for Sparsely Supervised Queries

Semi-supervised document retrieval

Optimizing Dense Retrieval Model Training with Hard Negatives.

Pre-trained Language Model based Ranking in Baidu Search

S 2 Phere: Semi-Supervised Pre-training for Web Search over Heterogeneous Learning to Rank Data

Active Learning for Web Search Ranking via Noise Injection