Abstract:A plethora of text-guided image editing methods have recently been developed by leveraging the impressive capabilities of large-scale diffusion-based generative models such as Imagen and Stable Diffusion. A standardized evaluation protocol, however, does not exist to compare methods across different types of fine-grained edits. To address this gap, we introduce EditVal, a standardized benchmark for quantitatively evaluating text-guided image editing methods. EditVal consists of a curated dataset of images, a set of editable attributes for each image drawn from 13 possible edit types, and an automated evaluation pipeline that uses pre-trained vision-language models to assess the fidelity of generated images for each edit type. We use EditVal to benchmark 8 cutting-edge diffusion-based editing methods including SINE, Imagic and Instruct-Pix2Pix. We complement this with a large-scale human study where we show that EditVall's automated evaluation pipeline is strongly correlated with human-preferences for the edit types we considered. From both the human study and automated evaluation, we find that: (i) Instruct-Pix2Pix, Null-Text and SINE are the top-performing methods averaged across different edit types, however {\it only} Instruct-Pix2Pix and Null-Text are able to preserve original image properties; (ii) Most of the editing methods fail at edits involving spatial operations (e.g., changing the position of an object). (iii) There is no `winner' method which ranks the best individually across a range of different edit types. We hope that our benchmark can pave the way to developing more reliable text-guided image editing tools in the future. We will publicly release EditVal, and all associated code and human-study templates to support these research directions in <a class="link-external link-https" href="https://deep-ml-research.github.io/editval/" rel="external noopener nofollow">this https URL</a>.

CVPR 2023 Text Guided Video Editing Competition

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Exploring Text-Guided Single Image Editing for Remote Sensing Images

ControlVideo: Training-free Controllable Text-to-Video Generation

Edit Temporal-Consistent Videos with Image Diffusion Model

Pix2Video: Video Editing using Image Diffusion

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods

EffiVED:Efficient Video Editing via Text-instruction Diffusion Models

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Video-P2P: Video Editing with Cross-attention Control

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

ICDAR 2023 Video Text Reading Competition for Dense and Small Text

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Consistent Video-to-Video Transfer Using Synthetic Dataset

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models