Abstract:Text-to-video retrieval answers user queries through search by concepts and embeddings. Limited by the size of the concept bank and the amount of training data, answering queries in the wild is not always effective due to the out-of-vocabulary problem. Furthermore, neither concept-based nor embedding-based search can perform reasoning to consolidate the search results for complex queries mixed with logical and spatial constraints. To address these problems, we leverage large language models (LLM) to paraphrase the query by text-to-text (T2T), text-to-image (T2I), and image-to-text (I2T) transformations. These transformations rephrase abstract concepts into simple words to address the out-of-vocabulary problem. Furthermore, the complex relationship in a query can be decoupled into simpler sub-queries, yielding better retrieval performance when fusing the search results of these sub-queries. To address the LLM hallucination problem, this paper also proposes a novel consistency-based verification strategy to filter the paraphrased queries that are factually incorrect. Extensive experiments are conducted for ad-hoc video search and known-item search on the TRECVid datasets. We provide empirical insights into how traditionally difficult-to-answer queries can be resolved by query paraphrasing.

What problem does this paper attempt to address?

The paper aims to address several key issues in text-to-video retrieval, mainly including: 1. **Out-of-Vocabulary (OOV) Problem**: The existing concept library is limited in size and the training data is insufficient, leading to ineffective handling of query terms that are not in the concept library. 2. **Complex Query Problem**: Complex queries that include logical constraints (such as "AND", "OR", "NOT", etc.) and spatial constraints (such as "in front", "located at the bottom corner of the scene", etc.) are difficult to handle. 3. **Representation Robustness Problem**: Since text-to-video representation learning is usually trained in a one-to-one matching manner, the retrieval results are very sensitive to the query expression. Even minor changes can lead to significantly different retrieval results. To address the above issues, the paper proposes a query rewriting framework based on a Large Language Model (LLM), which reformulates user queries through the following transformations: - **Text-to-Text (T2T)**: Utilizing chatGPT4 to rewrite the given user query in various ways to create query perturbations. This helps improve the robustness of text-to-video matching and may solve the OOV problem by replacing out-of-vocabulary words. - **Text-to-Image (T2I)**: Generating various images related to the query based on a stable diffusion model, converting text queries into visual queries. This allows for direct retrieval by calculating the visual similarity between the generated images and videos, thereby bypassing the OOV problem and complex logical and spatial constraints. - **Image-to-Text (I2T)**: Using an image captioning model to describe the images generated by T2I, the generated caption queries can further decompose complex queries. Even if the T2I-generated images have distortion or artifacts, effective query expressions can still be obtained in this way. Additionally, to ensure the consistency and accuracy of the rewritten queries with the original queries, the paper proposes a consistency-based verification mechanism. This mechanism filters out factually incorrect queries by checking the consistency between the rewritten queries and the question-answer (QA) pairs generated from the original queries. Finally, the paper conducts extensive experiments to verify the effectiveness of the proposed framework and demonstrates its performance advantages on the TRECVid dataset.

LLM-based query paraphrasing for video search

Query Representation by Structured Concept Threads with Application to Interactive Video Retrieval.

LLM-based Weak Supervision Framework for Query Intent Classification in Video Search

Exploring Key Concept Paraphrasing Based on Pivot Language Translation for Question Retrieval

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

Interpretable Embedding for Ad-hoc Video Search

Use What You Have: Video Retrieval Using Representations from Collaborative Experts.

Multi-modal query expansion for web video search

Utilizing Related Samples to Learn Complex Queries in Interactive Concept-Based Video Search

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Utilizing Related Samples to Enhance Interactive Concept-Based Video Search

Retrieval-based Video Language Model for Efficient Long Video Question Answering

Context-Enhanced Video Moment Retrieval with Large Language Models

Capturing the Semantics of Key Phrases Using Multiple Languages for Question Retrieval

Semantic Video Search by Exploiting Large-Scale Visual Concepts

Zero-shot Audio Topic Reranking using Large Language Models

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Multi-Level Query Interaction for Temporal Language Grounding

The importance of query-concept-mapping for automatic video retrieval.

A Revisit Of Query Expansion With Different Semantic Levels

Know where to go: Make LLM a relevant, responsible, and trustworthy searchers