Abstract:We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to ensure that the action sequences predicted to occur in the future in a video are reasonable (i.e., interpretable) in the real world. Specifically, although existing action prediction methods can predict future actions, they often overlook the rationality of action sequences, that is, whether these actions conform to the logical order and temporal relationships in real - life. For example, in the process of cooking eggs, "breaking the eggs" must occur before "frying the eggs", otherwise this action sequence is unreasonable. To overcome this challenge, the paper proposes a new method - PlausiVL, which generates reasonable and diverse future action sequences by leveraging large - scale video - language models. PlausiVL introduces two crucial loss functions: 1. **Plausible Action Sequence Learning Loss (\(L_{\text{plau}}\))**: This loss function aims to make the model learn to distinguish between reasonable and unreasonable (or counterfactual) action sequences. By creating counterfactual action sequences based on temporal logic constraints and verb - noun pair logic constraints, the model is trained to recognize which action sequences are reasonable and which are not. This helps the model understand the implicit temporal relationships between actions, thereby generating more reasonable action sequences. 2. **Long - Horizon Action Repetition Loss (\(L_{\text{rep}}\))**: This loss function aims to reduce the situation where the model repeats the same action over a long time range, thereby generating more diverse action sequences. By imposing a higher penalty on actions in the long - horizon, the model can avoid over - repeating certain actions, thereby improving the diversity of prediction. Through the combination of these two loss functions, PlausiVL can better understand the temporal relationships in action sequences and generate both reasonable and diverse future action predictions. The paper conducted experiments on two large - scale datasets, Ego4D and EPIC - Kitchens - 100, and the results show that PlausiVL significantly outperforms existing baseline methods in the action prediction task.

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation

Video Language Planning

Complex Video Action Reasoning Via Learnable Markov Logic Network

PALM: Predicting Actions through Language Models

Video-Language Models as Flexible Social and Physical Reasoners

VideoLLM: Modeling Video Sequence with Large Language Models

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Vamos: Versatile Action Models for Video Understanding

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

ST-LLM: Large Language Models Are Effective Temporal Learners

Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

Look Before you Speak: Visually Contextualized Utterances

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Predicting the Next Action by Modeling the Abstract Goal

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation

Intention-Conditioned Long-Term Human Egocentric Action Forecasting

ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning