Towards Retrieval Augmented Generation over Large Video Libraries

Yannis Tevissen,Khalil Guetari,Frédéric Petitpont
2024-06-21
Abstract:Video content creators need efficient tools to repurpose content, a task that often requires complex manual or automated searches. Crafting a new video from large video libraries remains a challenge. In this paper we introduce the task of Video Library Question Answering (VLQA) through an interoperable architecture that applies Retrieval Augmented Generation (RAG) to video libraries. We propose a system that uses large language models (LLMs) to generate search queries, retrieving relevant video moments indexed by speech and visual metadata. An answer generation module then integrates user queries with this metadata to produce responses with specific video timestamps. This approach shows promise in multimedia content retrieval, and AI-assisted video content creation.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper proposes a solution to the problem of low efficiency in reusing video content in large video libraries. Currently, video content creators face challenges in finding suitable content to create new videos, often requiring complex manual or automated searches. The paper introduces a task called Video Library Question Answering (VLQA) and applies retrieval-augmented generation (RAG) techniques to the video library through an interactive architecture. The authors designed a system that uses large language models (LLMs) to generate search queries to retrieve relevant video clips indexed by audio and visual metadata. The answer generation module combines the user queries with this metadata to generate responses that include specific video timestamps. This approach aims to improve the efficiency of multimedia content retrieval and assist AI in video content creation. Related work mainly focuses on video text retrieval and RAG techniques, but applying RAG to multimedia databases, especially video libraries, is more challenging. The proposed architecture includes a retrieval module and a dialogue generation module, with the former generating search queries and the latter integrating information and generating final answers. Experimental results show that this approach effectively finds relevant video clips and is applicable to projects requiring real event materials, such as news reports and documentaries. The paper discusses the advantages of this approach, such as being able to retrieve specific moments without audio, being fast, and highly interoperable. It also highlights limitations, such as reliance on carefully selected metadata indexes and the inability to perform certain specific analyses. Future directions include creating benchmarks to evaluate VLQA tasks and adding a multimodal re-ranking module to improve the architecture.