Abstract:Despite the recent advancements in information retrieval (IR), zero-shot IR remains a significant challenge, especially when dealing with new domains, languages, and newly-released use cases that lack historical query traffic from existing users. For such cases, it is common to use query augmentations followed by fine-tuning pre-trained models on the document data paired with synthetic queries. In this work, we propose a novel Universal Document Linking (UDL) algorithm, which links similar documents to enhance synthetic query generation across multiple datasets with different characteristics. UDL leverages entropy for the choice of similarity models and named entity recognition (NER) for the link decision of documents using similarity scores. Our empirical studies demonstrate the effectiveness and universality of the UDL across diverse datasets and IR models, surpassing state-of-the-art methods in zero-shot cases. The developed code for reproducibility is included in <a class="link-external link-https" href="https://github.com/eoduself/UDL" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively generate synthetic queries in zero - shot information retrieval (IR) to improve retrieval performance in new domains, new languages, and new use cases. Specifically, traditional information retrieval methods perform poorly when facing new domains or languages without historical query data, and directly using pre - trained dense retrieval models will also lead to a significant decline in performance. To solve this problem, the author proposes a new algorithm named Universal Document Linking (UDL). ### Main Goals of UDL 1. **Link Similar Documents**: Enhance the generation of synthetic queries by linking similar documents, so that these queries can cover the content of multiple documents. 2. **Utilize Entropy and Named Entity Recognition (NER)**: Select an appropriate similarity model and determine the linking relationship between documents based on the results of named entity recognition. 3. **Adapt to Multiple Datasets**: Ensure that the UDL algorithm can be generalized in datasets with different characteristics, thereby improving its performance in zero - shot situations. ### Specific Problem Description - **New Domains and Languages**: When dealing with new languages or domains, the lack of relevant query data makes it difficult to apply traditional information retrieval methods. - **Performance Degradation**: Directly applying pre - trained dense retrieval models to zero - shot scenarios will lead to a significant performance degradation and requires special fine - tuning. - **Query Expansion**: Existing query expansion methods usually rely on existing queries or documents, and in zero - shot scenarios, this dependence becomes infeasible. ### Solution The UDL algorithm solves the above problems through the following steps: 1. **Select a Similarity Model**: Based on term frequency - inverse document frequency (TF - IDF) and pre - trained language models (LM), select an appropriate similarity model by calculating entropy values. 2. **Determine Similarity Scores**: According to the results of named entity recognition (NER), combined with the size of the vocabulary, decide whether to link candidate documents. 3. **Link Documents**: Calculate the cosine similarity between documents, and when the similarity exceeds a set threshold, link these documents. Through these steps, UDL can generate more relevant and high - quality synthetic queries, thereby improving the performance of zero - shot information retrieval. ### Experimental Verification The author conducted experiments with multiple datasets and information retrieval models to verify the effectiveness and generality of UDL. The results show that UDL outperforms existing methods in zero - shot situations, especially in multilingual and cross - domain tasks. In summary, this paper aims to solve the challenges in zero - shot information retrieval, especially the query generation problem in new domains, new languages, and new use cases, and provides an effective solution by proposing the UDL algorithm.

Link, Synthesize, Retrieve: Universal Document Linking for Zero-Shot Information Retrieval

Precise Zero-Shot Dense Retrieval without Relevance Labels

DeepLink: A Deep Learning Approach for User Identity Linkage

Unified Language-driven Zero-shot Domain Adaptation

A Read-and-Select Framework for Zero-shot Entity Linking

Disentangled Ontology Embedding for Zero-shot Learning

Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback

Zero-Shot Learning Using Synthesised Unseen Visual Data with Diffusion Regularisation

Information Retrieval with Entity Linking

Large Language Models are Built-in Autoregressive Search Engines

Improving Few-shot and Zero-shot Entity Linking with Coarse-to-Fine Lexicon-based Retriever

Efficient Biomedical Entity Linking: Clinical Text Standardization with Low-Resource Techniques

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Introducing high correlation and high quality instances for few-shot entity linking

Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?

Entity Linking Meets Deep Learning: Techniques and Solutions

Low-Rank Subspaces for Unsupervised Entity Linking

Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval

Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking

GAR-meets-RAG Paradigm for Zero-Shot Information Retrieval

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning