Event-driven Real-time Retrieval in Web Search

Nan Yang,Shusen Zhang,Yannan Zhang,Xiaoling Bai,Hualong Deng,Tianhua Zhou,Jin Ma
2023-12-04
Abstract:Information retrieval in real-time search presents unique challenges distinct from those encountered in classical web search. These challenges are particularly pronounced due to the rapid change of user search intent, which is influenced by the occurrence and evolution of breaking news events, such as earthquakes, elections, and wars. Previous dense retrieval methods, which primarily focused on static semantic representation, lack the capacity to capture immediate search intent, leading to inferior performance in retrieving the most recent event-related documents in time-sensitive scenarios. To address this issue, this paper expands the query with event information that represents real-time search intent. The Event information is then integrated with the query through a cross-attention mechanism, resulting in a time-context query representation. We further enhance the model's capacity for event representation through multi-task training. Since publicly available datasets such as MS-MARCO do not contain any event information on the query side and have few time-sensitive queries, we design an automatic data collection and annotation pipeline to address this issue, which includes ModelZoo-based Coarse Annotation and LLM-driven Fine Annotation processes. In addition, we share the training tricks such as two-stage training and hard negative sampling. Finally, we conduct a set of offline experiments on a million-scale production dataset to evaluate our approach and deploy an A/B testing in a real online system to verify the performance. Extensive experimental results demonstrate that our proposed approach significantly outperforms existing state-of-the-art baseline methods.
Information Retrieval,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the unique challenges faced in information retrieval during real-time search, particularly the issue of rapidly changing user search intents as breaking news events occur and develop. Traditional intensive retrieval methods mainly focus on static semantic representations, which fail to capture immediate search intents, resulting in poor performance in retrieving documents related to the latest events in time-sensitive scenarios. To tackle this problem, the paper proposes a new approach—Event-driven Real-time Retrieval (ERR). Specifically, ERR addresses these issues through the following aspects: 1. **Query Expansion and Event Integration**: By introducing event information to expand the query, utilizing a cross-attention mechanism to integrate event information with query information, and generating a temporal context query representation. This helps to more accurately describe the latest query intents, especially when dealing with short-tail and long-tail queries. 2. **Multi-task Training**: Enhancing the model's ability to represent event information through multi-task training, making the model more focused on event information. 3. **Data Collection and Annotation**: Since existing public datasets like MS-MARCO lack event information on the query side and contain few time-sensitive queries, the paper designs an automatic data collection and annotation pipeline, including coarse annotation based on a model library and fine annotation based on a large language model. This method can quickly, efficiently, and cost-effectively obtain annotated data, particularly suitable for time-sensitive search scenarios. Through these methods, the paper aims to significantly improve the performance of real-time retrieval, especially when handling queries related to breaking news events. Extensive offline and online experimental results show that the ERR method significantly outperforms existing state-of-the-art baseline methods.