Description-Based Text Similarity

Shauli Ravfogel,Valentina Pyatkin,Amir DN Cohen,Avshalom Manevich,Yoav Goldberg
2024-07-24
Abstract:Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of \emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.
Computation and Language,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of inconsistency and sub-optimality in current text embeddings for information retrieval, particularly when searching for texts based on abstract descriptions of their content. The authors define a new concept called "description-based similarity," which captures the relationship between abstract descriptions and the concrete texts that instantiate these descriptions. The main problem addressed by the paper is that existing text encoders and retrieval systems struggle to retrieve texts that are specific instantiations of abstract descriptions. Current approaches, which rely on dense encoders and similarity search over vector embeddings, are driven by corpora and often mix various types of similarity, making them sub-optimal for targeted information-seeking scenarios. The authors propose a novel model that is specifically designed to improve retrieval based on abstract descriptions. They generate a dataset of <description, text> pairs using a large language model (LLM) and then train an encoder that learns to represent items in a way that abstract descriptions and the texts they describe are close in the embedding space. Key contributions include: 1. **Definition of Description-Based Similarity:** The authors define the abstract description relation and differentiate it from other semantic relations like paraphrasing, entailment, and summarization. 2. **Dataset Generation:** They use an LLM to generate valid and misleading descriptions