Abstract:Recently, generative retrieval has emerged as a promising alternative to the traditional retrieval paradigms. It assigns each document a unique identifier, known as the DocID, and employs a generative model to directly generate the relevant DocID for the input query. A common choice for the DocID is one or several natural language sequences, e.g. the title, synthetic queries, or n-grams, so that the pre-trained knowledge of the generative model can be effectively utilized. However, a sequence is generated token by token, where only the most likely candidates are kept and the rest are pruned at each decoding step, thus, retrieval fails if any token within the relevant DocID is falsely pruned. What's worse, during decoding, the model can only perceive preceding tokens in the DocID while being blind to subsequent ones, hence is prone to make such errors. To address this problem, we present a novel framework for generative retrieval, dubbed Term-Set Generation (TSGen). Instead of sequences, we use a set of terms as the DocID. The terms are selected based on learned weights from relevance signals, so that they concisely summarize the document's semantics and distinguish it from others. On top of the term-set DocID, we propose a permutation-invariant decoding algorithm, with which the term set can be generated in any permutation yet will always lead to the corresponding document. Remarkably, TSGen perceives all valid terms rather than only the preceding ones at each decoding step. Given the constant decoding space, it can make more reliable decisions due to the broader perspective. TSGen is also resilient to errors: the relevant DocID will not be falsely pruned as long as the decoded term belongs to it. Moreover, TSGen can explore the optimal decoding permutation of the term set on its own, which further improves the likelihood of generating the relevant DocID. Lastly, we design an iterative optimization procedure to incentivize the model to generate the relevant term set in its favorable permutation. We conduct extensive experiments on popular benchmarks of generative retrieval, which validate the effectiveness, the generalizability, the scalability, and the efficiency of TSGen.

Generative Retrieval as Multi-Vector Dense Retrieval

Generative Retrieval as Dense Retrieval

Generative Retrieval Meets Multi-Graded Relevance

ROGER: Ranking-oriented Generative Retrieval

Generative Dense Retrieval: Memory Can Be a Burden

Unifying Generative and Dense Retrieval for Sequential Recommendation

Generative Retrieval Via Term Set Generation

ASI++: Towards Distributionally Balanced End-to-End Generative Retrieval

Distillation Enhanced Generative Retrieval

Hi-Gen: Generative Retrieval For Large-Scale Personalized E-commerce Search

IRGen: Generative Modeling for Image Retrieval

Listwise Generative Retrieval Models via a Sequential Learning Process

Learning to Tokenize for Generative Retrieval

Understanding the Multi-vector Dense Retrieval Models

How Does Generative Retrieval Scale to Millions of Passages?

Model-enhanced Vector Index

Auto Search Indexer for End-to-End Document Retrieval

A Survey of Generative Information Retrieval

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

Generative Retrieval with Preference Optimization for E-commerce Search