EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
Thomas Hummel,Shyamgopal Karthik,Mariana-Iuliana Georgescu,Zeynep Akata
2024-07-24
Abstract:In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR. Our code and benchmark are freely available at <a class="link-external link-https" href="https://github.com/ExplainableML/EgoCVR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the need for high - quality fine - grained video understanding in the Composed Video Retrieval (CVR) task**. Specifically, the goal of the CVR task is to retrieve the most relevant video that conforms to the modified content from the video database according to the given reference video and the text query that describes how to modify this video.
### Background of the Paper and Problem Description
1. **Limitations of Existing Methods**:
- Existing CVR frameworks perform poorly when dealing with tasks that require fine - grained temporal understanding.
- Previous benchmark datasets (such as WebVid - CoVR) mainly focus on simple modifications such as the addition/removal of colors, shapes, or objects, while ignoring complex action changes.
2. **Research Motivations**:
- It is proposed that existing CVR models cannot effectively utilize the temporal information in videos, resulting in poor performance on tasks involving subtle action changes.
- A higher - quality and more challenging benchmark dataset is required to evaluate and improve CVR models.
### Main Contributions
1. **Proposing the EgoCVR Benchmark Dataset**:
- It contains 2,295 queries, and each query includes a reference video clip and a text instruction that describes how to modify this clip.
- The dataset is sourced from the Ego4D dataset, with an emphasis on subtle action changes that require strong temporal understanding capabilities.
2. **Evaluating Existing Models**:
- Multiple vision - language models (such as CLIP, BLIP, LanguageBind, etc.) were evaluated on the EgoCVR benchmark, and it was found that even after fine - tuning, the performance of existing models on EgoCVR is still not satisfactory.
3. **Proposing the Training - free Method TFR - CVR**:
- By combining visual filtering and a re - ranking strategy based on generating target captions, the performance of the CVR task is significantly improved.
- TFR - CVR achieved the best results on the EgoCVR benchmark, especially in the global retrieval setting, with an R@1 of 14.1%.
### Formula Representation
To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper:
- **Scoring Function**:
\[
\Phi: V \times T \times D \to \mathbb{R}
\]
where \( V \) is the video space, \( T \) is the text instruction space, and \( D \) is the video database.
- **Multimodal Video - Text Embedding**:
\[
q_{v,t} = \Psi_q\{q_v, q_t\} \in \mathbb{R}^d
\]
where \( \Psi_v: V \to \mathbb{R}^d \) is the video encoder, and \( \Psi_t: T \to \mathbb{R}^d \) is the text encoder.
- **Cosine Similarity Calculation**:
\[
V_t^q = \arg\max_{v \in D} \frac{\Psi_V(v)^\top \Psi_T(c_t^q)}{\|\Psi_V(v)\| \cdot \|\Psi_T(c_t^q)\|}
\]
- **Visual Filtering Step**:
\[
D' = \text{top}_{n_c} \left( \frac{\Psi_V(q_v)^\top \Psi_V(v)}{\|\Psi_V(q_v)\| \cdot \|\Psi_V(v)\|} \right)
\]
Through these contributions, this paper aims to promote the progress in the CVR field and provide a higher - quality benchmark dataset to evaluate and improve related models.