Audiovisual SlowFast Networks for Video Recognition

Fanyi Xiao,Yong Jae Lee,Kristen Grauman,Jitendra Malik,Christoph Feichtenhofer
DOI: https://doi.org/10.48550/arXiv.2001.08740
2020-03-09
Abstract:We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features. Code will be made available at: <a class="link-external link-https" href="https://github.com/facebookresearch/SlowFast" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the effective fusion of audio and visual signals in video understanding. Most existing video analysis models only utilize visual signals while ignoring audio signals. However, in many video understanding tasks, audio can provide important auxiliary information. For example, in some action recognition scenarios, audio not only helps in identification in sound - dominated situations (such as "playing the saxophone"), but also plays a crucial role in scenes that are visually indistinguishable (such as "whistling") or in distinguishing related actions (such as "closing the door" and "slamming the door"). To overcome this challenge, the paper proposes a new architecture named **Audiovisual SlowFast Networks (AVSlowFast)**. This architecture deeply fuses audio and visual paths at multiple levels to form a unified audiovisual perceptual representation. Specifically, the AVSlowFast network consists of the following parts: 1. **Slow and Fast visual paths**: These two paths are respectively used to capture static but semantically rich information and fast - moving information. 2. **Faster Audio path**: This path has a higher sampling rate, can capture audio information, and is fused with the visual paths at multiple levels. To address the problem of inconsistent learning dynamics between audio and visual paths during the training process, the paper introduces a technique named **DropPathway**, that is, randomly dropping the audio path during the training process as an effective regularization means to adjust the learning speed and make the learning dynamics of the audio path more compatible with those of the visual path. In addition, inspired by neuroscience research, the paper also proposes a hierarchical audiovisual synchronization technique to learn cross - modal joint features. Through these designs, AVSlowFast can achieve state - of - the - art results on multiple video action classification and detection datasets and demonstrates its generalization ability in self - supervised learning. In conclusion, this paper aims to construct an architecture that can effectively integrate audio and visual information, thereby improving the performance of video understanding tasks.