WebUltron: an Ultimate Retriever on Webpages under the Model-Centric Paradigm
Yujia Zhou,Jing Yao,Ledell Wu,Zhicheng Dou,Ji-Rong Wen
DOI: https://doi.org/10.1109/tkde.2023.3332858
IF: 9.235
2024-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Document retrieval has been extensively studied within the index-retrieve framework for decades, which has withstood the test of time. However, this approach inherently segregates the indexing and retrieval processes, preventing a cohesive, end-to-end optimization. To bridge this divide, we introduce WebUltron, a revolutionary model-centric indexer for document retrieval. This system embeds the entirety of document knowledge within the model, striving for seamless end-to-end retrieval. Two primary challenges with this indexer are the representation of document identifiers (docids) and the model's training. Current methods grapple with docids that lack semantic depth and the constraints of limited supervised data, making scaling up to larger datasets challenging. Addressing this, we've engineered two novel docid types imbued with richer semantics that also streamline model inference. Further enhancing WebUltron's capabilities, we've developed a three-stage training regimen, leveraging deeper corpus insights and fortifying query-docid relationships. Experiments on two public datasets demonstrate the superiority of WebUltron over advanced baselines for document retrieval.