Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models

Li Sun,Liuan Wang,Jun Sun,Takayuki Okatani
2024-01-18
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced the comprehension of multimedia content, bringing together diverse modalities such as text, images, and videos. However, a critical challenge faced by these models, especially when processing video inputs, is the occurrence of hallucinations - erroneous perceptions or interpretations, particularly at the event level. This study introduces an innovative method to address event-level hallucinations in MLLMs, focusing on specific temporal understanding in video content. Our approach leverages a novel framework that extracts and utilizes event-specific information from both the event query and the provided video to refine MLLMs' response. We propose a unique mechanism that decomposes on-demand event queries into iconic actions. Subsequently, we employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences. Our evaluation, conducted using the Charades-STA dataset, demonstrates a significant reduction in temporal hallucinations and an improvement in the quality of event-related responses. This research not only provides a new perspective in addressing a critical limitation of MLLMs but also contributes a quantitatively measurable method for evaluating MLLMs in the context of temporal-related questions.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the event - level hallucination problem that occurs when multimodal large language models (MLLMs) process video inputs. Specifically, the paper focuses on the misperceptions or misinterpretations in event time localization and sequence prediction when these models process video content. ### 1. Research Background In recent years, multimodal large language models (MLLMs) have made significant progress in understanding multimedia content and are able to integrate information from multiple modalities such as text, image, and video. However, when processing video inputs, especially at the event level, these models are prone to hallucination phenomena, that is, they generate incorrect understandings of the event time positions and sequences. ### 2. Main Problems The paper points out that existing research mainly focuses on object - level hallucination correction and ignores the event - level hallucination problem. Event - level hallucination refers to the situation where the model generates incorrect timestamps or event sequences when processing specific event queries in videos. This kind of hallucination will seriously affect the performance of the model when answering questions related to event time. ### 3. Solutions To solve this problem, the paper proposes an innovative method to mitigate event - level hallucination through the following steps: 1. **Event Decomposition**: Decompose the event query proposed by the user into multiple representative actions (iconic actions), which are key behaviors that are easy to identify in the video. 2. **Frame Matching**: Utilize external tools such as CLIP and BLIP2 to find the video frames that are most likely to contain these actions according to the decomposed action descriptions and predict the specific timestamps of these frames. 3. **Generate Claim**: Based on the extracted time information, generate a claim to correct the answers of MLLMs. The claim contains accurate time information, which helps the model answer questions about event occurrence time and sequence more precisely. 4. **Response Correction**: Combine the user's query, the original answer of MLLMs, and the generated claim, and use tools such as GPT - 3.5 - turbo to generate a new, corrected answer. ### 4. Experimental Results The paper conducts experimental verification through the Charades - STA dataset. The results show that this method significantly reduces the hallucination phenomenon in event time prediction and improves the accuracy of event sequence prediction. Compared with random prediction and the baseline model Video - LLaMA, this method achieves accuracies of 57.66% and 85.29% on the R@1 and R@5 metrics respectively. ### 5. Conclusions This research not only provides an effective method to mitigate the event - level hallucination problem of multimodal large language models when processing video content, but also provides new perspectives and evaluation criteria for future research. By introducing event decomposition and frame - matching techniques, the model can exhibit higher accuracy and reliability when processing video event queries. --- In summary, the main contribution of this paper lies in proposing a novel framework specifically for the event - level hallucination problem of multimodal large language models when processing video inputs, providing an effective solution, and proving its effectiveness through experiments.