LLM-based query paraphrasing for video search

Jiaxin Wu,Chong-Wah Ngo,Wing-Kwong Chan,Sheng-Hua Zhong
2024-07-17
Abstract:Text-to-video retrieval answers user queries through search by concepts and embeddings. Limited by the size of the concept bank and the amount of training data, answering queries in the wild is not always effective due to the out-of-vocabulary problem. Furthermore, neither concept-based nor embedding-based search can perform reasoning to consolidate the search results for complex queries mixed with logical and spatial constraints. To address these problems, we leverage large language models (LLM) to paraphrase the query by text-to-text (T2T), text-to-image (T2I), and image-to-text (I2T) transformations. These transformations rephrase abstract concepts into simple words to address the out-of-vocabulary problem. Furthermore, the complex relationship in a query can be decoupled into simpler sub-queries, yielding better retrieval performance when fusing the search results of these sub-queries. To address the LLM hallucination problem, this paper also proposes a novel consistency-based verification strategy to filter the paraphrased queries that are factually incorrect. Extensive experiments are conducted for ad-hoc video search and known-item search on the TRECVid datasets. We provide empirical insights into how traditionally difficult-to-answer queries can be resolved by query paraphrasing.
Multimedia
What problem does this paper attempt to address?
The paper aims to address several key issues in text-to-video retrieval, mainly including: 1. **Out-of-Vocabulary (OOV) Problem**: The existing concept library is limited in size and the training data is insufficient, leading to ineffective handling of query terms that are not in the concept library. 2. **Complex Query Problem**: Complex queries that include logical constraints (such as "AND", "OR", "NOT", etc.) and spatial constraints (such as "in front", "located at the bottom corner of the scene", etc.) are difficult to handle. 3. **Representation Robustness Problem**: Since text-to-video representation learning is usually trained in a one-to-one matching manner, the retrieval results are very sensitive to the query expression. Even minor changes can lead to significantly different retrieval results. To address the above issues, the paper proposes a query rewriting framework based on a Large Language Model (LLM), which reformulates user queries through the following transformations: - **Text-to-Text (T2T)**: Utilizing chatGPT4 to rewrite the given user query in various ways to create query perturbations. This helps improve the robustness of text-to-video matching and may solve the OOV problem by replacing out-of-vocabulary words. - **Text-to-Image (T2I)**: Generating various images related to the query based on a stable diffusion model, converting text queries into visual queries. This allows for direct retrieval by calculating the visual similarity between the generated images and videos, thereby bypassing the OOV problem and complex logical and spatial constraints. - **Image-to-Text (I2T)**: Using an image captioning model to describe the images generated by T2I, the generated caption queries can further decompose complex queries. Even if the T2I-generated images have distortion or artifacts, effective query expressions can still be obtained in this way. Additionally, to ensure the consistency and accuracy of the rewritten queries with the original queries, the paper proposes a consistency-based verification mechanism. This mechanism filters out factually incorrect queries by checking the consistency between the rewritten queries and the question-answer (QA) pairs generated from the original queries. Finally, the paper conducts extensive experiments to verify the effectiveness of the proposed framework and demonstrates its performance advantages on the TRECVid dataset.