Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model

Till Grutschus,Ola Karrar,Emir Esenov,Ekta Vats
2024-01-30
Abstract:This work explores the performance of a large video understanding foundation model on the downstream task of human fall detection on untrimmed video and leverages a pretrained vision transformer for multi-class action detection, with classes: "Fall", "Lying" and "Other/Activities of daily living (ADL)". A method for temporal action localization that relies on a simple cutup of untrimmed videos is demonstrated. The methodology includes a preprocessing pipeline that converts datasets with timestamp action annotations into labeled datasets of short action clips. Simple and effective clip-sampling strategies are introduced. The effectiveness of the proposed method has been empirically evaluated on the publicly available High-Quality Fall Simulation Dataset (HQFSD). The experimental results validate the performance of the proposed pipeline. The results are promising for real-time application, and the falls are detected on video level with a state-of-the-art 0.96 F1 score on the HQFSD dataset under the given experimental settings. The source code will be made available on GitHub.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to detect human fall behaviors in untrimmed videos. Specifically, the researchers explored the performance of using large - scale video understanding foundation models for human fall detection on untrimmed videos. They utilized pre - trained Vision Transformers to achieve multi - class action detection, including "fall", "lie down" and other activities of daily living (ADL). The paper proposed a simple temporal action localization method by splitting untrimmed videos into short segments and introduced some effective segment sampling strategies. This method was verified on the publicly available high - quality fall simulation dataset (HQFSD), and the experimental results showed that this method achieved the state - of - the - art 0.96 F1 score in the video - level fall detection task. ### Main Contributions 1. **Improve Fall Detection Performance in Untrimmed Videos**: The paper improved the performance of fall detection in untrimmed videos on the HQFSD dataset, which is challenging and can be generalized to practical applications. 2. **Temporal Action Localization Method**: Proposed a temporal action localization method based on simply splitting untrimmed videos. 3. **Pre - processing Pipeline**: Introduced a pre - processing pipeline for converting a dataset with time - stamped action annotations into a labeled dataset of short action segments. 4. **Exploration of Foundation Model Performance**: Explored the performance of large - scale video understanding foundation models in the downstream task of human fall detection and demonstrated its superiority over previous specialized architectures. ### Method Overview 1. **Segment Sampling Strategies**: - **Cutup Sampling**: Divide the video into equal - length segments through a sliding window. - **Gaussian Sampling**: Generate sampling points according to the Gaussian distribution, and these points are used as the center points of the segments, thus covering most of the original video. 2. **Labeling Strategies**: - Adopt a priority labeling strategy, in which the fall action has the highest priority, followed by lying down, and finally other activities of daily living. 3. **Action Recognition Model**: - Use the pre - trained VideoMAEv2 model to extract features. - Classify through a single fully - connected layer, using the cross - entropy loss function. ### Experimental Results - **Segment - level Classification**: The model can reliably predict segments under both sampling methods, with an average F1 score exceeding 0.9. - **Video - level Classification**: In the video - level classification task, the model using the Gaussian sampling strategy outperforms existing methods in both precision and recall, achieving an F1 score of 0.96. ### Conclusion The paper successfully implemented a method for human fall detection using large - scale foundation video understanding models in untrimmed video data. The experimental results show that this method performs excellently on the HQFSD dataset and has the potential for real - time applications. The researchers plan to release the source code and pre - trained model on GitHub for the research community to use.