Abstract:Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch.

End-to-end Action Quality Assessment with Action Parsing Transformer

Fine-Grained Spatio-Temporal Parsing Network for Action Quality Assessment

Interpretable Long-term Action Quality Assessment

FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment

Self-Supervised Sub-Action Parsing Network for Semi-Supervised Action Quality Assessment

Assessing action quality with semantic-sequence performance regression and densely distributed sample weighting

Hierarchical Graph Convolutional Networks for Action Quality Assessment

Dual-referenced assistive network for action quality assessment

GAIA: Rethinking Action Quality Assessment for AI-Generated Videos

Uncertainty-Driven Action Quality Assessment

Multimodal Action Quality Assessment

Hierarchical NeuroSymbolic Approach for Comprehensive and Explainable Action Quality Assessment

TSA-Net: Tube Self-Attention Network for Action Quality Assessment

Multi-Stage Contrastive Regression for Action Quality Assessment

Part-level Action Parsing Via a Pose-guided Coarse-to-Fine Framework

Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment

Group-aware Contrastive Regression for Action Quality Assessment

Action Q-Transformer: Visual Explanation in Deep Reinforcement Learning with Encoder-Decoder Model using Action Query

A Survey of Video-based Action Quality Assessment

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Automatic Modelling for Interactive Action Assessment