Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Himangi Mittal,Nakul Agarwal,Shao-Yuan Lo,Kwonjoon Lee
2024-05-31
Abstract:We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to ensure that the action sequences predicted to occur in the future in a video are reasonable (i.e., interpretable) in the real world. Specifically, although existing action prediction methods can predict future actions, they often overlook the rationality of action sequences, that is, whether these actions conform to the logical order and temporal relationships in real - life. For example, in the process of cooking eggs, "breaking the eggs" must occur before "frying the eggs", otherwise this action sequence is unreasonable. To overcome this challenge, the paper proposes a new method - PlausiVL, which generates reasonable and diverse future action sequences by leveraging large - scale video - language models. PlausiVL introduces two crucial loss functions: 1. **Plausible Action Sequence Learning Loss (\(L_{\text{plau}}\))**: This loss function aims to make the model learn to distinguish between reasonable and unreasonable (or counterfactual) action sequences. By creating counterfactual action sequences based on temporal logic constraints and verb - noun pair logic constraints, the model is trained to recognize which action sequences are reasonable and which are not. This helps the model understand the implicit temporal relationships between actions, thereby generating more reasonable action sequences. 2. **Long - Horizon Action Repetition Loss (\(L_{\text{rep}}\))**: This loss function aims to reduce the situation where the model repeats the same action over a long time range, thereby generating more diverse action sequences. By imposing a higher penalty on actions in the long - horizon, the model can avoid over - repeating certain actions, thereby improving the diversity of prediction. Through the combination of these two loss functions, PlausiVL can better understand the temporal relationships in action sequences and generate both reasonable and diverse future action predictions. The paper conducted experiments on two large - scale datasets, Ego4D and EPIC - Kitchens - 100, and the results show that PlausiVL significantly outperforms existing baseline methods in the action prediction task.