Abstract:In this work, following the intuition that adverbs describing scene-sequences are best identified by reasoning over high-level concepts of object-behavior, we propose the design of a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types. Importantly, while previous works for general scene adverb-recognition assume knowledge of the clips underlying action-types, our method is directly applicable in the more general problem setting where the action-type of a video-clip is unknown. Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips and propose novel symbolic and transformer based reasoning methods that operate over these extracted facts to identify adverb-types. Experiment results demonstrate that our proposed methods perform favourably against the previous state-of-the-art. Additionally, to support efforts in symbolic video-processing, we release two new datasets of object-behaviour-facts extracted from raw video clips - the MSR-VTT-ASP and ActivityNet-ASP datasets.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the problem of adverb type recognition in video clips. Specifically, the authors propose a new framework that extracts object behavior facts from raw video clips and reasons about these facts to identify the adverb types in the video clips. Unlike traditional methods based on 3D Convolutional Neural Networks (3D CNN), this framework does not rely on I3D encoding but achieves better adverb type recognition performance through high-level concept object behavior reasoning. ### Main Contributions 1. **New Framework Design**: A novel adverb type recognition framework is proposed, which extracts object behavior facts from raw video clips, generates high-level behavior summaries through reasoning about these facts, and ultimately predicts adverb types. 2. **Dataset Release**: Two new datasets, MSR-VTT-ASP and ActivityNet-ASP, are released, containing object behavior facts extracted from raw video clips. 3. **Reasoning Method**: A Transformer-based reasoning method is proposed to generate high-level object behavior summaries from the extracted facts. 4. **Experimental Validation**: Experimental results show that this method outperforms previous state-of-the-art methods on the MSR-VTT and ActivityNet datasets, demonstrating the effectiveness of the reasoning-based adverb type recognition approach. ### Method Overview 1. **Extraction Phase**: - Use MaskRCNN to detect objects and their behaviors in video clips. - Calculate the optical flow properties of each detected object, including optical flow magnitude and angle. - Filter detection results through a non-overlapping sliding window, retaining objects with faster movements. - Record detected object behaviors as Answer Set Programming (ASP) facts. 2. **Reasoning Phase**: - **Single-Step Symbolic Baseline**: Use FastLAS to automatically learn indicative rules defining adverb types. - **Transformer-Based Reasoning**: Convert object behavior facts into flat representations, train a Transformer model using the Masked Language Modeling (MLM) method to generate behavior summary vectors. 3. **Prediction Phase**: - Concatenate behavior summary vectors with action type embedding vectors of the video clips, input into a Support Vector Machine (SVM) for binary classification to distinguish each adverb and its antonym. - During testing, aggregate multiple object behavior predictions through majority voting to determine the adverb type of the video clip. ### Experimental Results - **Single-Step Symbolic Baseline**: Performs best in distinguishing adverbs like "gently" and "firmly," but performs poorly in handling more complex adverb types such as "periodically" and "continuously." - **Transformer-Based Reasoning**: DistilBERT, ALBERT, and BERT perform excellently across all tasks, significantly outperforming the symbolic baseline method. ### Conclusion This paper proposes a novel adverb type recognition framework that effectively improves adverb type recognition performance by extracting object behavior facts from raw video clips and performing high-level reasoning. Experimental results validate the effectiveness and superiority of this approach.

Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

Explainable Video Action Reasoning via Prior Knowledge and State Transitions

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Look, Remember and Reason: Grounded reasoning in videos with language models

Towards Neuro-Symbolic Video Understanding

Reasoning-Enhanced Object-Centric Learning for Videos

Visual Explanation by High-Level Abduction: On Answer-Set Programming Driven Reasoning about Moving Objects

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning

Further Understanding Videos through Adverbs: A New Video Task

Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Slot Abstractors: Toward Scalable Abstract Visual Reasoning

Anticipating Object State Changes in Long Procedural Videos

Multi-modal Action Chain Abductive Reasoning

Learning Action Changes by Measuring Verb-Adverb Textual Relationships

Language Model Guided Interpretable Video Action Reasoning

Video In Sentences Out