Abstract:Human Activity Recognition (HAR) systems aim to understand human behaviour and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals. Each modality provides unique and complementary information suited to different application scenarios. Consequently, numerous studies have investigated diverse approaches for HAR using these modalities. This paper presents a comprehensive survey of the latest advancements in HAR from 2014 to 2024, focusing on machine learning (ML) and deep learning (DL) approaches categorized by input data modalities. We review both single-modality and multi-modality techniques, highlighting fusion-based and co-learning frameworks. Additionally, we cover advancements in hand-crafted action features, methods for recognizing human-object interactions, and activity detection. Our survey includes a detailed dataset description for each modality and a summary of the latest HAR systems, offering comparative results on benchmark datasets. Finally, we provide insightful observations and propose effective future research directions in HAR.
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve
This paper aims to address several key issues in the field of Human Activity Recognition (HAR):
1. **Comprehensive Methodological Survey of Multimodal Data**:
- The paper conducts a comprehensive methodological survey of various data modalities used in the HAR field from 2014 to 2024, such as RGB images and videos, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals.
- The survey covers both unimodal and multimodal techniques, with a focus on fusion techniques and collaborative learning frameworks.
2. **Challenges in Feature Representation and Selection**:
- Traditional feature representation methods (such as handcrafted features) face numerous challenges when dealing with actions in videos, including background noise, camera movement, occlusion, etc.
- The paper explores how to leverage deep learning to automatically learn features to improve the robustness and accuracy of HAR systems.
3. **Requirements of Different Application Scenarios**:
- Different data modalities are suitable for different application scenarios, such as surveillance, healthcare, human-computer interaction, etc.
- The paper analyzes the advantages and disadvantages of these modalities in different scenarios and provides detailed descriptions of benchmark datasets and comparisons of the latest system performances.
4. **Proposing Future Research Directions**:
- The paper not only summarizes existing research results but also proposes future research directions, including new feature extraction methods, multimodal fusion techniques, real-time processing capabilities, etc.
### Main Contributions
1. **Comprehensive Multimodal Survey**:
- Conducted an exhaustive survey of HAR methods for RGB, skeleton, sensor, and fusion modalities, focusing on data acquisition, environment, and the evolution of human activity representation.
2. **Detailed Dataset Descriptions**:
- Provided a detailed overview of public benchmark datasets for RGB, skeleton, sensor, and fusion data, highlighting the latest performance accuracies.
3. **Unique Processing Workflow**:
- Covered feature representation methods, commonly used datasets, challenges, and future directions, emphasizing the importance of extracting distinguishable action features from video data under environmental and hardware constraints.
4. **Identification of Research Gaps and Future Directions**:
- Identified significant gaps in current research and proposed future research directions, supported by the latest performance data for each modality.
5. **System Performance Evaluation**:
- Provided benchmark datasets by analyzing the recognition accuracy of existing HAR systems, offering a reference for future development.
6. **Guidance for Practitioners**:
- Offered practical guidance for developing robust and accurate HAR systems, including insights into current technologies, highlighting challenges, and suggesting future research directions to advance HAR systems.
### Research Questions
The paper primarily answers the following questions:
1. **What are the main difficulties in HAR?**
2. **What challenges does HAR face?**
3. **What are the main algorithms involved in HAR?**
### Organization Structure
The organization structure of the paper is as follows:
1. **Introduction**: Introduces the importance and application areas of HAR.
2. **Action Recognition Methods Based on RGB Data Modality**: Describes in detail the action recognition methods based on RGB data.
3. **Action Recognition Methods Based on Skeleton Data Modality**: Describes in detail the action recognition methods based on skeleton data.
4. **Human Activity Recognition Based on Sensor Modality**: Describes in detail the action recognition methods based on sensor data.
5. **Human Activity Recognition with Multimodal Fusion**: Describes in detail the multimodal fusion techniques.
6. **Current Challenges**: Discusses the challenges faced in the HAR field.
7. **Future Research Trends and Directions**: Proposes future research directions.
8. **Conclusion**: Summarizes the main findings and contributions of the paper.
Through these contents, the paper provides comprehensive references and guidance for researchers and practitioners in the HAR field.