Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

Cagri Gungor,Adriana Kovashka
2024-09-15
Abstract:First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the domain generalization problem in first - person activity recognition**. Specifically, due to the domain differences between different environments (such as different objects or background scenes), the performance of first - person activity recognition models will significantly decline when applied to new data. These domain differences will cause the models to perform poorly in unseen scenes and locations. ### Main contributions of the paper: 1. **Multimodal framework**: - Proposed a multimodal framework that integrates motion, audio, and appearance features to enhance domain generalization ability. - Through experimental analysis, proved the stronger robustness of motion and audio features to domain changes, emphasizing their crucial role in domain generalization. 2. **Use of audio narration**: - Introduced audio narration to enhance audio - text alignment, thereby improving the robustness of action representation. - Calculated the consistency score between audio and visual narration and optimized the influence of audio in prediction. 3. **Application of consistency score**: - Used large - language models (LLM) to calculate the consistency score between audio and visual narration and weighted audio embeddings with these scores during the training process, reducing the influence of noise and irrelevant audio cues. ### Key technical points: - **Multimodal fusion**: Encoded appearance, motion, and audio features respectively and enhanced feature representation through independent visual - text and audio - text alignment. - **Consistency weighting**: Adjusted the weight of audio embeddings through consistency scores to ensure that the parts of audio information that are semantically consistent with visual content have a greater impact on the final prediction. - **Contrastive learning**: Used contrastive loss functions for inter - modal alignment to ensure that features of different modalities are consistent at the conceptual level. ### Experimental results: - Achieved state - of - the - art performance on the ARGO1M dataset and effectively generalized to unseen scenes and locations. - Verified through experiments that the performance decline of motion and audio features under domain changes is relatively small, being 25.8% and 32.7% respectively, while the performance decline of appearance features is 54.8%. In conclusion, this paper proposed a novel multimodal framework, which significantly improved the domain generalization ability in first - person activity recognition tasks by integrating motion, audio, and appearance features.