Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

Mengze Hong,Chen Jason Zhang
2024-08-08
Abstract:Semantic Embedding Model (SEM), a neural network-based Siamese architecture, is gaining momentum in information retrieval and natural language processing. In order to train SEM in a supervised fashion for Web search, the search engine query log is typically utilized to automatically formulate pairwise judgments as training data. Despite the growing application of semantic embeddings in the search engine industry, little work has been done on formulating effective pairwise judgments for training SEM. In this paper, we make the first in-depth investigation of a wide range of strategies for generating pairwise judgments for SEM. An interesting (perhaps surprising) discovery reveals that the conventional pairwise judgment formulation strategy wildly used in the field of pairwise Learning-to-Rank (LTR) is not necessarily effective for training SEM. Through a large-scale empirical study based on query logs and click-through activities from a major commercial search engine, we demonstrate the effective strategies for SEM and highlight the advantages of a hybrid heuristic (i.e., Clicked > Non-Clicked) in comparison to the atomic heuristics (e.g., Clicked > Skipped) in LTR. We conclude with best practices for training SEM and offer promising insights for future research.
Information Retrieval,Artificial Intelligence,Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to automatically formulate high - quality pairwise judgments for the Semantic Embedding Model (SEM) in Web search. Specifically, researchers hope to generate effective pairwise training examples by using information such as user click - behavior in search engine query logs to train SEM in a supervised manner. ### Problem Background 1. **Applications of the Semantic Embedding Model (SEM)**: - SEM is a Siamese architecture based on neural networks and has attracted increasing attention in the fields of information retrieval and natural language processing. - In Web search, SEM improves the relevance of search results by learning the semantic similarity between queries and web page titles. 2. **Importance of Pairwise Judgments**: - Pairwise judgments refer to comparing two documents or titles according to users' implicit preferences (such as click, skip, etc. behaviors) to form "better" training data. - In traditional Learning - to - Rank (LTR) methods, the generation strategies of pairwise judgments have been widely studied, but the applicability of these strategies in SEM has not been fully verified. ### Research Motivation - Although there are mature pairwise judgment generation strategies in the LTR field, it is not clear whether these strategies are applicable to SEM. - Researchers hope to explore pairwise judgment generation strategies suitable for SEM through large - scale experiments, thereby improving its training effect. ### Main Contributions 1. **First In - Depth Study**: This is the first in - depth study on the application of pairwise judgment generation strategies in SEM. 2. **Evidence from Large - Scale Experiments**: Based on the query logs and click data of a major commercial search engine, large - scale experiments were carried out. 3. **Proposing Effective Strategies**: Some new pairwise judgment generation strategies were discovered and verified, especially the "Clicked > Non - Examined" strategy, which is significantly better than traditional strategies. 4. **Advantages of Hybrid Strategies**: A hybrid strategy (such as Clicked > Non - Clicked) was proposed, which further improves the training effect. ### Conclusion - Traditional LTR pairwise judgment generation strategies are not necessarily applicable to SEM, and special strategies are required to generate effective training data. - The "Clicked > Non - Examined" strategy performs well in SEM training, and the hybrid strategy (such as Clicked > Non - Clicked) has practical advantages. Through this research, the author provides valuable insights for future research and best practice suggestions for practitioners to improve the performance of embedding - based models.