Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition

Amrit Diggavi Seshadri,Alessandra Russo
2024-03-28
Abstract:In this work, following the intuition that adverbs describing scene-sequences are best identified by reasoning over high-level concepts of object-behavior, we propose the design of a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types. Importantly, while previous works for general scene adverb-recognition assume knowledge of the clips underlying action-types, our method is directly applicable in the more general problem setting where the action-type of a video-clip is unknown. Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips and propose novel symbolic and transformer based reasoning methods that operate over these extracted facts to identify adverb-types. Experiment results demonstrate that our proposed methods perform favourably against the previous state-of-the-art. Additionally, to support efforts in symbolic video-processing, we release two new datasets of object-behaviour-facts extracted from raw video clips - the MSR-VTT-ASP and ActivityNet-ASP datasets.
Computer Vision and Pattern Recognition,Artificial Intelligence,Symbolic Computation
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address the problem of adverb type recognition in video clips. Specifically, the authors propose a new framework that extracts object behavior facts from raw video clips and reasons about these facts to identify the adverb types in the video clips. Unlike traditional methods based on 3D Convolutional Neural Networks (3D CNN), this framework does not rely on I3D encoding but achieves better adverb type recognition performance through high-level concept object behavior reasoning. ### Main Contributions 1. **New Framework Design**: A novel adverb type recognition framework is proposed, which extracts object behavior facts from raw video clips, generates high-level behavior summaries through reasoning about these facts, and ultimately predicts adverb types. 2. **Dataset Release**: Two new datasets, MSR-VTT-ASP and ActivityNet-ASP, are released, containing object behavior facts extracted from raw video clips. 3. **Reasoning Method**: A Transformer-based reasoning method is proposed to generate high-level object behavior summaries from the extracted facts. 4. **Experimental Validation**: Experimental results show that this method outperforms previous state-of-the-art methods on the MSR-VTT and ActivityNet datasets, demonstrating the effectiveness of the reasoning-based adverb type recognition approach. ### Method Overview 1. **Extraction Phase**: - Use MaskRCNN to detect objects and their behaviors in video clips. - Calculate the optical flow properties of each detected object, including optical flow magnitude and angle. - Filter detection results through a non-overlapping sliding window, retaining objects with faster movements. - Record detected object behaviors as Answer Set Programming (ASP) facts. 2. **Reasoning Phase**: - **Single-Step Symbolic Baseline**: Use FastLAS to automatically learn indicative rules defining adverb types. - **Transformer-Based Reasoning**: Convert object behavior facts into flat representations, train a Transformer model using the Masked Language Modeling (MLM) method to generate behavior summary vectors. 3. **Prediction Phase**: - Concatenate behavior summary vectors with action type embedding vectors of the video clips, input into a Support Vector Machine (SVM) for binary classification to distinguish each adverb and its antonym. - During testing, aggregate multiple object behavior predictions through majority voting to determine the adverb type of the video clip. ### Experimental Results - **Single-Step Symbolic Baseline**: Performs best in distinguishing adverbs like "gently" and "firmly," but performs poorly in handling more complex adverb types such as "periodically" and "continuously." - **Transformer-Based Reasoning**: DistilBERT, ALBERT, and BERT perform excellently across all tasks, significantly outperforming the symbolic baseline method. ### Conclusion This paper proposes a novel adverb type recognition framework that effectively improves adverb type recognition performance by extracting object behavior facts from raw video clips and performing high-level reasoning. Experimental results validate the effectiveness and superiority of this approach.