Abstract:Relevance modeling between queries and items stands as a pivotal component in commercial search engines, directly affecting the user experience. Given the remarkable achievements of large language models (LLMs) in various natural language processing (NLP) tasks, LLM-based relevance modeling is gradually being adopted within industrial search systems. Nevertheless, foundational LLMs lack domain-specific knowledge and do not fully exploit the potential of in-context learning. Furthermore, structured item text remains underutilized, and there is a shortage in the supply of corresponding queries and background knowledge. We thereby propose CPRM (Continual Pre-training for Relevance Modeling), a framework designed for the continual pre-training of LLMs to address these issues. Our CPRM framework includes three modules: 1) employing both queries and multi-field item to jointly pre-train for enhancing domain knowledge, 2) applying in-context pre-training, a novel approach where LLMs are pre-trained on a sequence of related queries or items, and 3) conducting reading comprehension on items to produce associated domain knowledge and background information (e.g., generating summaries and corresponding queries) to further strengthen LLMs. Results on offline experiments and online A/B testing demonstrate that our model achieves convincing performance compared to strong baselines.

What problem does this paper attempt to address?

This paper attempts to solve the problem of query - item relevance modeling in commercial search engines. Specifically, the paper points out that although large - language models (LLMs) have achieved remarkable success in various natural - language - processing tasks, there are several challenges when applied to relevance modeling in commercial search: 1. **Lack of domain - specific knowledge**: Basic LLMs are pre - trained on a wide range of data sources, which do not pay special attention to specific domains, resulting in the models lacking domain - specific knowledge. 2. **Semantic gap**: Queries are usually in the form of short, colloquial texts, while items are usually expressed in more formal, long - text forms, which leads to a "semantic gap" between their representations. 3. **Task - agnostic pre - training**: The pre - training stage of LLMs is "task - agnostic", which hinders their direct connection with downstream tasks and excludes the possibility of context - pre - training enhancements customized for these tasks. 4. **Under - utilization of structured data**: Item data is highly structured and difficult to utilize, which prevents LLMs from fully realizing their potential. To address the above problems, the authors propose CPRM (Continual Pre - training for Relevance Modeling), a framework for continuously pre - training LLMs to improve relevance modeling. The CPRM framework consists of three modules: 1. **Domain Knowledge Enhancement (DKE)**: Use queries and multi - field items for joint pre - training to enhance domain knowledge. 2. **In - Context Pre - training (ICP)**: Pre - train LLMs by constructing a sequence of related queries or items, which is a novel method. 3. **Reading Comprehension Distillation (RCD)**: Perform reading comprehension on items to generate relevant domain knowledge and background information (such as generating summaries and corresponding queries), further strengthening the capabilities of LLMs. Through these methods, CPRM aims to improve the performance of LLMs in relevance modeling in commercial search. Experimental results show that CPRM performs well in both offline experiments and online A/B tests, significantly outperforming the baseline model.

CPRM: A LLM-based Continual Pre-training Framework for Relevance Modeling in Commercial Search

SPM: Structured Pretraining and Matching Architectures for Relevance Modeling in Meituan Search

Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Large Language Models for Relevance Judgment in Product Search

Multi-Grained Topological Pre-Training of Language Models in Sponsored Search

Pretrained Language Model based Web Search Ranking: From Relevance to Satisfaction

Towards More Relevant Product Search Ranking Via Large Language Models: An Empirical Study

ILCR: Item-based Latent Factors for Sparse Collaborative Retrieval

Explainable LLM-driven Multi-dimensional Distillation for E-Commerce Relevance Learning

QUERT: Continual Pre-training of Language Model for Query Understanding in Travel Domain Search

Know where to go: Make LLM a relevant, responsible, and trustworthy searchers

EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models with Semi-structured Data

Learning to Expand: Reinforced Pseudo-relevance Feedback Selection for Information-seeking Conversations

RLPS: A Reinforcement Learning–Based Framework for Personalized Search

Improving Text Matching in E-Commerce Search with A Rationalizable, Intervenable and Fast Entity-Based Relevance Model

Weakly Supervised Co-Training of Query Rewriting Andsemantic Matching for E-Commerce

Aligning Query Representation with Rewritten Query and Relevance Judgments in Conversational Search

Learning a Product Relevance Model from Click-Through Data in E-Commerce

Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher

Deep Bag-of-Words Model: An Efficient and Interpretable Relevance Architecture for Chinese E-Commerce

Tuning Query Reformulator with Fine-Grained Relevance Feedback