Abstract:Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in zero - shot grounded situation recognition (ZS - GSR). Specifically, ZS - GSR requires the model not only to recognize the salient activities (verbs) in the image, but also to detect all semantic roles participating in the activity and identify the nouns corresponding to these roles. Traditional category - based prompting methods have the following limitations when dealing with this complex task: 1. **Ambiguous Action Concepts**: Pretrained vision - language models (VLMs) such as CLIP have difficulty capturing the nuances of actions when using category - based prompts for verb classification, leading to misclassification. 2. **Restricted Role Localization**: The fixed verb - centered template makes the localization of semantic roles less flexible and accurate. Especially when encountering unfamiliar or complex verb categories, misalignment and localization errors are likely to occur. 3. **Lack of Context in Noun Prediction**: Category - based prompts focus too much on the most prominent category when classifying nouns, ignoring the context information of the specific scene, resulting in classification results that do not conform to the actual semantic roles. To solve these problems, the author proposes to enhance the model's understanding of different categories through the Language EXplainer (LEX). LEX improves the processes of verb recognition, semantic role localization, and noun recognition through three explainers (verb explainer, grounding explainer, and noun explainer) respectively. These explainers use large - language models (LLMs) to generate richer descriptions, thereby significantly improving the model's performance in zero - shot situation recognition. ### Specific Problems and Solutions 1. **Ambiguous Action Concepts**: - **Solution**: Introduce the Verb Explainer to generate multi - angled verb - centered descriptions and enhance the distinction between different verb categories. For example, for the verb "studying", the explainer can generate descriptions such as "books or notes displayed", which helps the model better understand the meaning of the verb. 2. **Restricted Role Localization**: - **Solution**: Design the Grounding Explainer to re - phrase the verb - centered template, making the localization of semantic roles clearer and more accurate. For example, for the verb "climb", the explainer can generate descriptions such as "At a PLACE, the AGENT overcomes a challenge OBSTACLE using a TOOL", which can better guide the model to localize relevant roles. 3. **Lack of Context in Noun Prediction**: - **Solution**: Introduce the Noun Explainer to generate scene - specific noun descriptions according to specific verbs and semantic roles, ensuring that noun classification conforms to the scene context. For example, for the scene "woman applying lotion", the explainer can generate descriptions such as "lotion applied in dots or spread thinly across the face", avoiding misclassifying the hand as a noun. Through these improvements, the LEX framework significantly improves the performance and interpretability of zero - shot situation recognition.

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination

Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

Good Questions Help Zero-Shot Image Reasoning

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Zero-shot Cross-lingual Conversational Semantic Role Labeling

Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following

VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Grounded 3D-LLM with Referent Tokens