Description-Based Text Similarity

Shauli Ravfogel,Valentina Pyatkin,Amir DN Cohen,Avshalom Manevich,Yoav Goldberg

2024-07-24

Abstract:Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of \emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.

Computation and Language,Information Retrieval,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the issue of inconsistency and sub-optimality in current text embeddings for information retrieval, particularly when searching for texts based on abstract descriptions of their content. The authors define a new concept called "description-based similarity," which captures the relationship between abstract descriptions and the concrete texts that instantiate these descriptions. The main problem addressed by the paper is that existing text encoders and retrieval systems struggle to retrieve texts that are specific instantiations of abstract descriptions. Current approaches, which rely on dense encoders and similarity search over vector embeddings, are driven by corpora and often mix various types of similarity, making them sub-optimal for targeted information-seeking scenarios. The authors propose a novel model that is specifically designed to improve retrieval based on abstract descriptions. They generate a dataset of <description, text> pairs using a large language model (LLM) and then train an encoder that learns to represent items in a way that abstract descriptions and the texts they describe are close in the embedding space. Key contributions include: 1. **Definition of Description-Based Similarity:** The authors define the abstract description relation and differentiate it from other semantic relations like paraphrasing, entailment, and summarization. 2. **Dataset Generation:** They use an LLM to generate valid and misleading descriptions

Description-Based Text Similarity

Bridging the Semantic Latent Space Between Brain and Machine: Similarity is All You Need

Statutes Recommendation Based on Text Similarity.

Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data

Evolution of Semantic Similarity -- A Survey

A Comparative Study of Sentence Embedding Models for Assessing Semantic Variation

Estimating Text Similarity based on Semantic Concept Embeddings

Graph-Based Text Similarity Measurement by Exploiting Wikipedia As Background Knowledge

Interactive optimization of embedding-based text similarity calculations

A new similarity measure for vector space models in text classification and information retrieval

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

Correlation Coefficients and Semantic Textual Similarity

Similarity of Objects and the Meaning of Words

Determining Semantic Textual Similarity using Natural Deduction Proofs

A Novel Discrimination Structure for Assessing Text Semantic Similarity

A Proposal for Linguistic Similarity Datasets Based on Commonality Lists

Measurement of Text Similarity: A Survey

A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports

Learning Element Similarity Matrix for Semi-Structured Document Analysis

Interpreting BERT-based Text Similarity via Activation and Saliency Maps

A survey on the techniques, applications, and performance of short text semantic similarity