Abstract:First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **the domain generalization problem in first - person activity recognition**. Specifically, due to the domain differences between different environments (such as different objects or background scenes), the performance of first - person activity recognition models will significantly decline when applied to new data. These domain differences will cause the models to perform poorly in unseen scenes and locations. ### Main contributions of the paper: 1. **Multimodal framework**: - Proposed a multimodal framework that integrates motion, audio, and appearance features to enhance domain generalization ability. - Through experimental analysis, proved the stronger robustness of motion and audio features to domain changes, emphasizing their crucial role in domain generalization. 2. **Use of audio narration**: - Introduced audio narration to enhance audio - text alignment, thereby improving the robustness of action representation. - Calculated the consistency score between audio and visual narration and optimized the influence of audio in prediction. 3. **Application of consistency score**: - Used large - language models (LLM) to calculate the consistency score between audio and visual narration and weighted audio embeddings with these scores during the training process, reducing the influence of noise and irrelevant audio cues. ### Key technical points: - **Multimodal fusion**: Encoded appearance, motion, and audio features respectively and enhanced feature representation through independent visual - text and audio - text alignment. - **Consistency weighting**: Adjusted the weight of audio embeddings through consistency scores to ensure that the parts of audio information that are semantically consistent with visual content have a greater impact on the final prediction. - **Contrastive learning**: Used contrastive loss functions for inter - modal alignment to ensure that features of different modalities are consistent at the conceptual level. ### Experimental results: - Achieved state - of - the - art performance on the ARGO1M dataset and effectively generalized to unseen scenes and locations. - Verified through experiments that the performance decline of motion and audio features under domain changes is relatively small, being 25.8% and 32.7% respectively, while the performance decline of appearance features is 54.8%. In conclusion, this paper proposed a novel multimodal framework, which significantly improved the domain generalization ability in first - person activity recognition tasks by integrating motion, audio, and appearance features.

Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Unsupervised Domain Adaptation in Activity Recognition: A GAN-Based Approach

See, Move and Hear: a Local-to-global Multi-Modal Interaction Network for Video Action Recognition.

A Multi-modal Egocentric Activity Recognition Approach towards Video Domain Generalization

Multimodal Generation of Novel Action Appearances for Synthetic-to-Real Recognition of Activities of Daily Living

Multimodal fusion for audio-image and video action recognition

Egocentric Audio-Visual Object Localization

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

Multimodal Distillation for Egocentric Action Recognition

AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

Domain-Specific Priors and Meta Learning for Few-Shot First-Person Action Recognition

MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition

Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors

Channel-Temporal Attention for First-Person Video Domain Adaptation

Holistic-Guided Disentangled Learning with Cross-Video Semantics Mining for Concurrent First-Person and Third-Person Activity Recognition.

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Self-organizing neural integration of pose-motion features for human action recognition

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language