Abstract:Humans observe various actions being performed by other humans (physically or in videos/images) and can draw a wide range of inferences about it beyond what they can visually perceive. Such inferences include determining the aspects of the world that make action execution possible (e.g. liquid objects can undergo pouring), predicting how the world will change as a result of the action (e.g. potatoes being golden and crispy after frying), high-level goals associated with the action (e.g. beat the eggs to make an omelet) and reasoning about actions that possibly precede or follow the current action (e.g. crack eggs before whisking or draining pasta after boiling). Similar reasoning ability is highly desirable in autonomous systems that would assist us in performing everyday tasks. To that end, we propose a multi-modal task to learn aforementioned concepts about actions being performed in images. We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images, collected from an annotated cooking-video dataset. We propose ActionCOMET, a zero-shot framework to discern knowledge present in language models specific to the provided visual input. We present baseline results of ActionCOMET over the collected dataset and compare them with the performance of the best existing VQA approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to enable autonomous systems to possess human - like common - sense reasoning abilities, especially regarding the specific common - sense concepts of actions in images. Specifically, the paper proposes a novel multimodal task, aiming to learn action - related common - sense concepts (such as preconditions, effects, high - level goals, and preceding and following actions) from images, in order to help autonomous systems better understand and perform daily tasks. ### Specific description of the problem 1. **Human common - sense reasoning abilities**: - Humans can infer information beyond the visual perception range by observing the actions of others or in videos. For example, they can judge whether the properties of an object allow an action to occur (such as that a liquid can be poured), predict the results of an action (such as that potatoes will become golden and crispy after frying), understand the high - level goals behind an action (such as that cracking an egg is for making an omelette), and reason about the actions that may occur before or after the current action (such as breaking the egg before stirring or draining the noodles after boiling). 2. **Challenges for autonomous systems**: - Autonomous systems need to possess similar human common - sense reasoning abilities in order to effectively assist humans in completing daily tasks in complex environments. However, existing models (including large - language models) face many challenges in tasks involving reasoning, especially when dealing with complex physical, spatial, temporal, visual, and causal relationships. 3. **Limitations of existing benchmarks**: - Most of the existing common - sense benchmarks focus on surface associations and ignore deeper physical, spatial, temporal, visual, and causal understanding. For example, a person will cook when hungry, and a person will call the police in order to abide by the law. ### Solutions in the paper To solve the above problems, the paper makes the following contributions: - **Proposing a novel multimodal task**: Learning the preconditions, effects, high - level goals, and preceding and following actions of actions in images. - **Creating a dataset**: Constructing a dataset containing 8,500 images and 59,300 image - based action inferences, which are from annotated cooking videos. - **Developing the ActionCOMET framework**: Proposing a zero - shot method for identifying knowledge specific to the input image in the language model. - **Experimental verification**: Conducting ablation experiments using diverse input prompts, and reporting quantitative and qualitative results, while comparing with the state - of - the - art VQA systems. ### Formula representation There are no specific mathematical formulas involved in the paper, but in order to ensure the correctness and readability of formulas, if it is necessary to express related concepts, formulas in Markdown format can be used. For example: - Suppose \( V \) represents the visual embedding sequence of an image, \( t \) represents the text description, \( p \) represents the action - object pair, and \( r \) represents the reasoning type, then we hope to generate a set of possible inferences \( H=\{s^r_1, s^r_2,\ldots, s^r_{|H|}\} \). In this way, the paper aims to improve the ability of autonomous systems in understanding and reasoning about daily - life actions, making them more intelligent and practical.

ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

Compositional Zero-shot Learning Via Progressive Language-based Observations

Video Action Understanding

Home Action Genome: Cooperative Compositional Action Understanding

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

Objects2action: Classifying and Localizing Actions without Any Video Example

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

Telling Stories for Common Sense Zero-Shot Action Recognition

All About Knowledge Graphs for Actions

The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

Recognizing an Action Using Its Name: A Knowledge-Based Approach

Opening the Vocabulary of Egocentric Actions

I Know the Relationships: Zero-Shot Action Recognition Via Two-Stream Graph Convolutional Networks and Knowledge Graphs.

Compositional Zero-Shot Learning for Attribute-Based Object Reference in Human-Robot Interaction

Understanding action concepts from videos and brain activity through subjects' consensus

ActionCLIP: A New Paradigm for Video Action Recognition

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition