Abstract:Semantic Embedding Model (SEM), a neural network-based Siamese architecture, is gaining momentum in information retrieval and natural language processing. In order to train SEM in a supervised fashion for Web search, the search engine query log is typically utilized to automatically formulate pairwise judgments as training data. Despite the growing application of semantic embeddings in the search engine industry, little work has been done on formulating effective pairwise judgments for training SEM. In this paper, we make the first in-depth investigation of a wide range of strategies for generating pairwise judgments for SEM. An interesting (perhaps surprising) discovery reveals that the conventional pairwise judgment formulation strategy wildly used in the field of pairwise Learning-to-Rank (LTR) is not necessarily effective for training SEM. Through a large-scale empirical study based on query logs and click-through activities from a major commercial search engine, we demonstrate the effective strategies for SEM and highlight the advantages of a hybrid heuristic (i.e., Clicked > Non-Clicked) in comparison to the atomic heuristics (e.g., Clicked > Skipped) in LTR. We conclude with best practices for training SEM and offer promising insights for future research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to automatically formulate high - quality pairwise judgments for the Semantic Embedding Model (SEM) in Web search. Specifically, researchers hope to generate effective pairwise training examples by using information such as user click - behavior in search engine query logs to train SEM in a supervised manner. ### Problem Background 1. **Applications of the Semantic Embedding Model (SEM)**: - SEM is a Siamese architecture based on neural networks and has attracted increasing attention in the fields of information retrieval and natural language processing. - In Web search, SEM improves the relevance of search results by learning the semantic similarity between queries and web page titles. 2. **Importance of Pairwise Judgments**: - Pairwise judgments refer to comparing two documents or titles according to users' implicit preferences (such as click, skip, etc. behaviors) to form "better" training data. - In traditional Learning - to - Rank (LTR) methods, the generation strategies of pairwise judgments have been widely studied, but the applicability of these strategies in SEM has not been fully verified. ### Research Motivation - Although there are mature pairwise judgment generation strategies in the LTR field, it is not clear whether these strategies are applicable to SEM. - Researchers hope to explore pairwise judgment generation strategies suitable for SEM through large - scale experiments, thereby improving its training effect. ### Main Contributions 1. **First In - Depth Study**: This is the first in - depth study on the application of pairwise judgment generation strategies in SEM. 2. **Evidence from Large - Scale Experiments**: Based on the query logs and click data of a major commercial search engine, large - scale experiments were carried out. 3. **Proposing Effective Strategies**: Some new pairwise judgment generation strategies were discovered and verified, especially the "Clicked > Non - Examined" strategy, which is significantly better than traditional strategies. 4. **Advantages of Hybrid Strategies**: A hybrid strategy (such as Clicked > Non - Clicked) was proposed, which further improves the training effect. ### Conclusion - Traditional LTR pairwise judgment generation strategies are not necessarily applicable to SEM, and special strategies are required to generate effective training data. - The "Clicked > Non - Examined" strategy performs well in SEM training, and the hybrid strategy (such as Clicked > Non - Clicked) has practical advantages. Through this research, the author provides valuable insights for future research and best practice suggestions for practitioners to improve the performance of embedding - based models.

Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

Improving Multi-Faceted Book Search by Incorporating Sparse Latent Semantic Analysis of Click-Through Logs

A Multi-View Semi-Supervised Approach for Task-Level Web Search Success Evaluation.

Search Task Success Evaluation by Exploiting Multi-View Active Semi-Supervised Learning.

Sebpr: Semantics Enhanced Bayesian Personalized Ranking With Comparable Item Pairs

Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

Learning Multi-Stage Multi-Grained Semantic Embeddings for E-Commerce Search

Improving semantic video retrieval models by training with a relevance-aware online mining strategy

Semantic-enhanced discriminative embedding learning for cross-modal retrieval

From Semantic Retrieval to Pairwise Ranking: Applying Deep Learning in E-commerce Search

Large Language Models are Built-in Autoregressive Search Engines

Learning a Semantic Space of Web Search via Session Data

Semantic snippet construction for search engine results based on segment evaluation

Learning a Unified Embedding Space of Web Search from Large-Scale Query Log

Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising

SEMINAR: Search Enhanced Multi-modal Interest Network and Approximate Retrieval for Lifelong Sequential Recommendation

Learning to Rank Query Recommendations by Semantic Similarities

Cross-media semantic representation via bi-directional learning to rank.

Clickage: towards bridging semantic and intent gaps via mining click logs of search engines.

Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search