Abstract:Audiovisual data is everywhere in this digital age, which raises higher requirements for the deep learning models developed on them. To well handle the information of the multi-modal data is the key to a better audiovisual modal. We observe that these audiovisual data naturally have temporal attributes, such as the time information for each frame in the video. More concretely, such data is inherently multi-modal according to both audio and visual cues, which proceed in a strict chronological order. It indicates that temporal information is important in multi-modal acoustic event modeling for both intra- and inter-modal. However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac, by modeling such temporal information via graph learning techniques. In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments. Each segment can be considered as a node, and the temporal relationships between nodes can be considered as timestamps on their edges. In this case, we can smoothly capture the dynamic information in intra-modal and inter-modal. Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance. Our code is available at <a class="link-external link-https" href="https://github.com/MGitHubL/TMac" rel="external noopener nofollow">this https URL</a>.

TIM: A Time Interval Machine for Audio-Visual Action Recognition

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

When Spatial meets Temporal in Action Recognition

Leveraging Temporal Contextualization for Video Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

Video Time: Properties, Encoders and Evaluation

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

TubeR: Tubelet Transformer for Video Action Detection

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Temporal Tessellation: A Unified Approach for Video Analysis

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Temporal Modeling Approach for Video Action Recognition Based on Vision-language Models.

AIM: Adapting Image Models for Efficient Video Action Recognition

End-to-end Multi-modal Video Temporal Grounding

OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

TAM: Temporal Adaptive Module for Video Recognition

Temporal Action Detection by Joint Identification-Verification.

Towards Long-Form Video Understanding

AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts