Abstract:As audio-visual systems are being deployed for safety-critical tasks such as surveillance and malicious content filtering, their robustness remains an under-studied area. Existing published work on robustness either does not scale to large-scale dataset, or does not deal with multiple modalities. This work aims to study several key questions related to multi-modal learning through the lens of robustness: 1) Are multi-modal models necessarily more robust than uni-modal models? 2) How to efficiently measure the robustness of multi-modal learning? 3) How to fuse different modalities to achieve a more robust multi-modal model? To understand the robustness of the multi-modal model in a large-scale setting, we propose a density-based metric, and a convexity metric to efficiently measure the distribution of each modality in high-dimensional latent space. Our work provides a theoretical intuition together with empirical evidence showing how multi-modal fusion affects adversarial robustness through these metrics. We further devise a mix-up strategy based on our metrics to improve the robustness of the trained model. Our experiments on AudioSet and Kinetics-Sounds verify our hypothesis that multi-modal models are not necessarily more robust than their uni-modal counterparts in the face of adversarial examples. We also observe our mix-up trained method could achieve as much protection as traditional adversarial training, offering a computationally cheap alternative. Implementation: <a class="link-external link-https" href="https://github.com/lijuncheng16/AudioSetDoneRight" rel="external noopener nofollow">this https URL</a>

Robustness Analysis of Video-Language Models Against Visual and Language Perturbations

Large-scale Robustness Analysis of Video Action Recognition Models

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

On Robustness to Missing Video for Audiovisual Speech Recognition

Benchmarking Robustness under Distribution Shift of Multimodal Image-Text Models

Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models

On Adversarial Robustness of Large-scale Audio Visual Learning

MVTamperBench: Evaluating Robustness of Vision-Language Models

Robustness of LLMs to Perturbations in Text

Robustness Analysis on Foundational Segmentation Models

Towards Evaluating the Robustness of Visual State Space Models

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Using Videos to Evaluate Image Model Robustness

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

On Robustness in Multimodal Learning

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Can 3D Vision-Language Models Truly Understand Natural Language?

Do Vision-Language Foundational models show Robust Visual Perception?

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

Robustness Testing of Language Understanding in Task-Oriented Dialog