Abstract:Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at <a class="link-external link-https" href="https://imagine.enpc.fr/" rel="external noopener nofollow">this https URL</a> ventural/covr.

CoVA: Context-aware Visual Attention for Webpage Information Extraction

WIERT: Web Information Extraction via Render Tree

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Cross-domain Attention Network with Wasserstein Regularizers for E-commerce Search

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution

Simplified DOM Trees for Transferable Attribute Extraction from the Web

CoVR-2: Automatic Data Construction for Composed Video Retrieval

Modeling Entities as Semantic Points for Visual Information Extraction in the Wild

WebVision Database: Visual Learning and Understanding from Web Data

CoEvo-Net: Coevolution Network for Video Highlight Detection

Task-driven Visual Saliency and Attention-based Visual Question Answering

Latent Visual Context Learning for Web Image Applications

Weakly Supervised Co-Training of Query Rewriting Andsemantic Matching for E-Commerce

CARE: Co-Attention Network for Joint Entity and Relation Extraction

Enhanced E-Commerce Attribute Extraction: Innovating with Decorative Relation Correction and LLAMA 2.0-Based Annotation

Leveraging Webpage Classification for Data Object Recognition.

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Template-Independent Web Object Extraction

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

WebFormer: the Web-page Transformer for Structure Information Extraction