Abstract:Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the hallucination phenomenon in text - to - video generation models (Text - to - Video, T2V). Specifically, when these models generate videos based on text prompts, visual elements that are inconsistent with or do not match the input text description often appear, which seriously affects the authenticity and reliability of the videos. These problems are particularly crucial in applications such as content creation, education, and simulation systems, because these scenarios require that the generated content must strictly follow the description of the input text. To meet this challenge, the authors introduced a large - scale text - to - video benchmark named ViBe, which is specifically used to evaluate the hallucination phenomenon in T2V models. ViBe systematically studies and classifies the hallucination phenomenon in T2V models by identifying five main types of hallucinations - Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open - source T2V models, the authors developed the first large - scale hallucination video dataset, which contains 3,782 human - annotated videos that are classified into one of the above five categories. In addition, ViBe also provides a standardized framework for quantifying the hallucination phenomenon, that is, the deviation or misrepresentation between the generated visual content and the input text, aiming to promote the understanding of the hallucination phenomenon in T2V models and the development of mitigation methods. By providing a comprehensive dataset and evaluation criteria, ViBe not only helps in the evaluation of current T2V models, but also provides a basis for future research and improvement, with the goal of developing more accurate and reliable T2V models so that the generated video content is closer to the semantic description of the input text.

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models

EventHallusion: Diagnosing Event Hallucinations in Video LLMs

Visual Hallucination: Definition, Quantification, and Prescriptive Remediations

Visual Hallucinations of Multi-modal Large Language Models

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Fine-grained Hallucination Detection and Editing for Language Models

HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

VisDiaHalBench: A Visual Dialogue Benchmark For Diagnosing Hallucination in Large Vision-Language Models

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Hallucination of Multimodal Large Language Models: A Survey

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models