Abstract:In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results. To facilitate further research in this field, we intend to release our dataset and code.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in the field of video surveillance, to provide a detailed description of the behavior of each individual in the video, especially in the case of multiple individuals in complex scenarios. Current video - level caption datasets cannot provide fine - grained descriptions of the specific behaviors of each individual, which makes it difficult to accurately determine the specific identity of each individual, thus affecting the accurate assessment and response to potential risks. To meet this challenge, the paper constructs a human - centered video surveillance caption dataset (UCF - crime captioning dataset, UCCD), which provides detailed descriptions of the dynamic behaviors of 7,820 individuals distributed in 1,012 videos. In addition, the paper also proposes a new video captioning method that can describe individual behaviors in detail at the individual level and achieves state - of - the - art results.
### Main contributions of the paper:
1. **Constructed a human - centered video surveillance caption dataset**: This dataset includes 1,012 videos and details the behaviors of 7,820 individuals. Specifically, the position, clothing, and interaction with other scene elements of each individual are labeled in the dataset, which greatly enriches the understanding of human interactions in complex scenarios.
2. **Proposed the video surveillance captioning task for the first time**: This task aims to understand human behaviors in videos and generate descriptions of human behaviors, opening up a new research direction in the field of video surveillance.
3. **Proposed a new video captioning method**: This method is based on the Deformable Transformer, can extract frame and human features, and generate behavior descriptions of each individual in the video through the localization head and caption head, achieving state - of - the - art results.
### Characteristics of the dataset:
- **Data source**: Different from most datasets from YouTube, the UCF - crime captioning dataset is derived from real - world video surveillance.
- **Video integrity**: The dataset provides a wide range of video surveillance scenarios, details the behaviors of individuals in the video, and has high scene and temporal continuity.
- **Behavior complexity**: The dataset contains not only descriptions of normal scenarios but also descriptions of various abnormal scenarios. These abnormal scenarios involve more complex human behaviors, increasing the richness and challenge of the dataset.
- **Labeling difficulty**: During the labeling process, since only the bounding box of the first appearance of the individual is provided, the labeler needs to continuously track the individual and describe its interaction with others. This increases the difficulty of labeling but also improves the quality of the dataset.
### Method overview:
The method proposed in the paper is mainly divided into two parts:
1. **Feature encoding**: Frames are extracted from the video, and frame - level feature extraction is performed using a pre - trained visual model. Then, the deformable encoder is used to process the extracted features, and YOLOv7 + Strongsort is combined for individual detection and tracking to extract individual features.
2. **Decoding**: The features of individuals in the video are combined with frame features and input into the decoder. The decoder outputs query features, which are connected to the localization head and caption head to generate captions for each individual. The loss function includes localization loss and caption loss, which are respectively used to calculate the time of appearance and disappearance of individuals and compare the generated captions with the real captions.
Through these innovations, the paper provides a new basis and direction for behavior analysis and anomaly detection in the field of video surveillance.