Abstract:Despite the recent advancements in information retrieval (IR), zero-shot IR remains a significant challenge, especially when dealing with new domains, languages, and newly-released use cases that lack historical query traffic from existing users. For such cases, it is common to use query augmentations followed by fine-tuning pre-trained models on the document data paired with synthetic queries. In this work, we propose a novel Universal Document Linking (UDL) algorithm, which links similar documents to enhance synthetic query generation across multiple datasets with different characteristics. UDL leverages entropy for the choice of similarity models and named entity recognition (NER) for the link decision of documents using similarity scores. Our empirical studies demonstrate the effectiveness and universality of the UDL across diverse datasets and IR models, surpassing state-of-the-art methods in zero-shot cases. The developed code for reproducibility is included in <a class="link-external link-https" href="https://github.com/eoduself/UDL" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively generate synthetic queries in zero - shot information retrieval (IR) to improve retrieval performance in new domains, new languages, and new use cases. Specifically, traditional information retrieval methods perform poorly when facing new domains or languages without historical query data, and directly using pre - trained dense retrieval models will also lead to a significant decline in performance. To solve this problem, the author proposes a new algorithm named Universal Document Linking (UDL).
### Main Goals of UDL
1. **Link Similar Documents**: Enhance the generation of synthetic queries by linking similar documents, so that these queries can cover the content of multiple documents.
2. **Utilize Entropy and Named Entity Recognition (NER)**: Select an appropriate similarity model and determine the linking relationship between documents based on the results of named entity recognition.
3. **Adapt to Multiple Datasets**: Ensure that the UDL algorithm can be generalized in datasets with different characteristics, thereby improving its performance in zero - shot situations.
### Specific Problem Description
- **New Domains and Languages**: When dealing with new languages or domains, the lack of relevant query data makes it difficult to apply traditional information retrieval methods.
- **Performance Degradation**: Directly applying pre - trained dense retrieval models to zero - shot scenarios will lead to a significant performance degradation and requires special fine - tuning.
- **Query Expansion**: Existing query expansion methods usually rely on existing queries or documents, and in zero - shot scenarios, this dependence becomes infeasible.
### Solution
The UDL algorithm solves the above problems through the following steps:
1. **Select a Similarity Model**: Based on term frequency - inverse document frequency (TF - IDF) and pre - trained language models (LM), select an appropriate similarity model by calculating entropy values.
2. **Determine Similarity Scores**: According to the results of named entity recognition (NER), combined with the size of the vocabulary, decide whether to link candidate documents.
3. **Link Documents**: Calculate the cosine similarity between documents, and when the similarity exceeds a set threshold, link these documents.
Through these steps, UDL can generate more relevant and high - quality synthetic queries, thereby improving the performance of zero - shot information retrieval.
### Experimental Verification
The author conducted experiments with multiple datasets and information retrieval models to verify the effectiveness and generality of UDL. The results show that UDL outperforms existing methods in zero - shot situations, especially in multilingual and cross - domain tasks.
In summary, this paper aims to solve the challenges in zero - shot information retrieval, especially the query generation problem in new domains, new languages, and new use cases, and provides an effective solution by proposing the UDL algorithm.