Abstract:With the development of modern life, all kinds of social activities have become more frequent. These social activities are often attended by a wide range of people, which puts forward high requirements for effective management and ensures the safety of the people involved in the activities. As an effective auxiliary measure, crowd understanding and analysis has been concerned by more and more researchers. The basic idea is to extract the key information from video sequences/images, and use digital image processing technology to study and analyse the behaviour characteristics and patterns of people in the region of interest. Currently, it has a wide range of applications in the fields of economy, public security, and so forth. This Special Issue aims to introduce the latest works in crowd understanding and analysis, and proposes new theory and approaches to solve the existing problems. The Special Issue received a number of submissions from researchers in the field. Finally, 27 papers were accepted after careful peer reviews and revisions. The accepted papers can be broadly divided into six sections according to the types of tasks, including Section 1: Crowd behaviour detection and recognition, Section 2: Crowd counting, Section 3: Object detection and recognition, and Section 4: Object tracking, and Section 5: Other tasks. “Human behaviour recognition with mid-level representations for crowd understanding and analysis” of Sun et al. uses mid-level semantic concepts to represent human actions from videos and argues that these semantic attributes enable the construction of more descriptive methods for human action recognition. The idea is verified on three challenging datasets, and the experimental results demonstrate that their method achieves better results than the baseline methods on human action recognition. “Class structure-aware adversarial loss for cross-domain human action recognition” of Chen et al. proposes a class structure-aware adversarial loss, which aims to address the issue that the existing adversarial-based approaches ignore the underlying coherence of class structure across domains. This paper incorporates category information into the adversarial learning branch to capture the fine-grained alignment of each class, effectively avoiding the false mix-up of samples from different categories in the embedding space. Experiments show significant improvement compared to the baseline. “Dual-view 3D human pose estimation without camera parameters for action recognition” of Liu et al. proposes a dual-view single-person 3D pose estimation method without camera parameters. This method first uses the 2D pose estimation network to estimate the 2D joint point coordinates from two images with different views, and then inputs them into the 3D regression network to generate the final 3D joint point coordinates. Experiments show that this method is effective. “Deep social force network for anomaly event detection” of Yang et al. develops a deep social force network by exploiting both social force extraction and deep motion coding. This network can discover the interaction force of particles to learn the deep social force features. The experiments on UCF-Crime and ShanghaiTech datasets demonstrate that this method can predict the temporal localization of anomaly events and outperform the state-of-the-art methods. “Anomaly detection in video sequences: a benchmark and computational model” of Wan et al. contributes a new Large-scale Anomaly Detection (LAD) dataset as the benchmark for anomaly detection in video sequences. This paper formulates anomaly detection as a fully supervised learning problem and proposes a multi-task deep neural network to solve it. Experimental results show that the proposed method outperforms state-of-the-art anomaly detection competitors on the proposed dataset and other public datasets. “Behaviour detection in crowded classroom scenes via enhancing features robust to scale and perspective variations” of Liu et al. proposes two modules to tackle the large variations of humans in scale and pose perspective, namely Scale Attention Aggregation module and RoI Spatial Transformation module. Besides, this paper constructs a new classroom human behaviour detection dataset with 1500 images based on a third-person perspective. The proposed method is verified on the proposed dataset. Experimental results demonstrate the effectiveness of the proposed method with better mAP values. “Crowd activity recognition in live video streaming via 3D-ResNet and region graph convolution network” of Kang et al. presents a crowd activity recognition method to identify and supervise the crowd activity from live videos. The method utilizes 3D-ResNet and ReGCN to extract deep spatiotemporal features and the correlation or external knowledge between crowd content, respectively. In addition to the above work, the authors construct a real-world video dataset called BJUT-CAD that includes eight kinds of crowd activity videos collected from live video websites. Experiments on BJUT-CAD and CAE datasets verify the effectiveness of this work. “Latent label mining for group activity recognition in basketball videos” of Wu et al. proposes a latent label mining strategy for group activity recognition in basketball videos. This paper aims at mining the latent labels of motion patterns from the frames and further combining two levels of supervision signal to obtain effective spatio-temporal representation. Experimental results demonstrate that the proposed algorithm achieves state-of-the-art performance. “A deep learning method for video-based action recognition” of Zhang et al. aims to address a key issue, that is, how to convert spatial and temporal information into an effective representation to infer actions. This paper employs boundary compensation on the basis of a deep neural network to achieve action proposal. Based on the resultant action proposals, a two-stream network with a spatio-temporal structure is adopted for the action recognition task. The experimental results show the competitive performance of the proposed method over the state-of-the-art methods. “MSR-FAN: Multi-scale residual feature-aware network for crowd counting” of Zhao et al. proposes a framework that combines the multi-scale features using multiple receptive field sizes and learns the feature-aware information on each image. This method effectively alleviates perspective distortion and the varying scales in congested scene images, which helps the algorithm crowd counting correctly. Experiment results on benchmark datasets indicate that the proposed approach outperforms the existing competitors. “MFP-Net: Multi-scale feature pyramid network for crowd counting” of Lei et al. introduces a feature pyramid fusion module and a feature attention-aware module. Two modules can extract different levels of fine-grained information; local and global context information, respectively. It enhances the correlation of different features and improves robustness to background noise effectively. Experiments show that the proposed method not only provides better crowd counting results than comparative models, but also requires fewer parameters. “Multi-level features extraction network with gating mechanism for crowd counting” of Zeng et al. designs a novel crowd counting model, which integrates multi-level information from multiple levels, such as appearance, scale, and context. To avoid interference from confusing information, a simple and effective multi-channel gated unit is proposed to adaptively select features at different levels of the network. Extensive experiments and evaluations clearly illustrate that the approach is superior. “Learn from object counting: crowd counting with meta-learning” of Zan et al. develops an efficient algorithm to extract the meta-information via utilizing object counting data in few-shot scenes. This method successfully explores the shared information between the crowd counting task and the object counting task, thus improving the performance and convergence rate of the crowd counting task. Comprehensive experiments on two datasets verify the effectiveness of the proposed approach. “Crowd estimation using key-point matching with support vector regression” of Ekanayake et al. proposes a novel key-point-based moving object detection in noisy backgrounds. This work mainly uses the key-point matching of continuous frames and optical flow density to identify moving objects in video sequences. The moving blobs are then generated via morphological operations. The experiments are compared with recent regression-based and CNN methods, and it is verified that the proposed work is superior. “A novel face recognition method based on fusion of LBP and HOG” of Chen et al. proposes an improved fusion local feature extraction algorithm called CS-NWALBP+HOG. This study not only smoothens noise sensitivity of the LBP operator, but also reduces the original computational complexity, as well as strengthens the description ability for image gradient direction information. Several experiments eventually demonstrate that the designed algorithm shows more robust performance under complex illumination conditions. “Multi-view intrinsic low-rank representation for robust face recognition and clustering” of Shen et al. considers the problem that the most existing methods ignore the specific local structure of different views. To address this issue, the paper proposes a multi-view low-rank representation method which exploits both intrinsic relationships and specific local structures of different views simultaneously. Experiments on several datasets demonstrate the effectiveness of this method in classification and clustering. “Multi-dimensional weighted cross-attention network in crowded scenes” of Xie et al. proposes an end-to-end anchor-free network, namely Multidimensional Weighted Cross-Attention Network, which can perform real-time human detection in crowded scenes. The designed model does not require manual intervention, and reduces the sizeable computational resource cost due to the anchor boxes mapping during the training process. Experiments reveal that the improved strategy achieves state-of-the-art results in the anchor-free methods. “Part-level attention networks for cross-domain person re-identification” of Zhao et al. uses the diversified spatial semantic feature in pixel-level learning in the target domain to improve the generality and adaptability of the model. Combined with partial branches, this method proves that it is effective to add an attention cascade module to the backbone network. Experiments indicate that the proposed approach has better recognition ability and robustness in cross-domain aspect. “MFNet-LE: Multilevel fusion network with Laplacian embedding for face presentation attacks detection” of Niu et al. proposes a face presentation attack detection method by incorporating a multilevel fusion structure and Laplacian loss into shallow CNNs. This allows the proposed model to learn more discriminative features under the joint supervision of softmax and Laplacian loss, which improves the detection ability. Experiments demonstrate the effectiveness of the proposed method. “Real-time automatic helmet detection of motorcyclists in urban traffic using improved YOLOv5 detector” of Jia et al. presents an automatic helmet detection that contains two steps. This work first utilizes improved YOLOv5 detector to detect motorcycles. Then, the motorcycle detected in the above step is input, and the improved YOLOv5 detector is used again to detect whether the rider is wearing a helmet. The proposed method is evaluated on constructed dataset, and the results show that it is superior to other detection methods. “Multi-label learning based target detecting from multi-frame data” of Mei et al. regards target detection from time series data as a multi-label problem to design the model. In this method, a background subtraction tracker is presented to track the slightly moving object in videos, which are based on Gaussian mixture model background subtraction and integral image. Experimental results show that the proposed method attains better performance. “Contrastive learning of graph encoder for accelerating pedestrian trajectory prediction training” of Yao et al. proposes a graph contrastive accelerating encoder. It accelerates the pedestrian trajectory prediction training process of spatiotemporal graph transformer networks. This method makes the pedestrian trajectory prediction error the lowest in the obviously early training steps, and makes the final performance reach the state-of-the-art level. “Multiple object tracking based on multi-task learning with strip attention” of Song et al. believes that it is difficult to strike a balance between accuracy and efficiency by embedding the re-identification (re-ID) model into the target tracking task. To enhance the overall tracking performance, a one-shot multiple object tracking is proposed based on multi-task learning, which contains two homogeneous branches of object detection and re-ID. By the fine-grained features extraction in pedestrian recognition, it benefits overall tracking framework improvement in both speed and robustness. The experiments show that the proposed method attains superior performance in more evaluation metrics. “Cross-modal semantic correlation learning by Bi-CNN network” of Wang et al. presents a novel cross modal retrieval framework, which integrates feature learning and latent space embedding. It aims to generate specific representations consistent with cross-modal tasks. It helps to reduce the differences in the distribution of categories in different modalities. Experiments on three real-word datasets show that the proposed work is superior to the popular methods. “Adaptive colour restoration and detail retention for image enhancement” of He et al. aims to overcome the problem of colour distortion caused by low illumination and fog. Considering the issue, this paper develops a multi-channel fusion-based adaptive image colour restoration method. To generate human-consistent observations, the detailed retention-based method is applied to enhance the details. Experiments demonstrate that the results are effective and outperform the compared methods both in visual and objective evaluations. “Image encryption algorithm for crowd data based on a new hyperchaotic system and Bernstein polynomial” of Jiang et al. designs a new two-dimensional chaotic system with hyperchaotic behaviour based on the Chebyshev system and the infinite collapse system. To protect the crowd image data, an image cryptosystem combined with the SVD and Bernstein polynomial is proposed. Security analyses indicate that this method has higher encryption efficiency and the visual quality of steganography image can reach 39 dB. “CA-PMG: Channel attention and progressive multi-granularity training network for fine-grained visual classification” of Zhao et al. designs a framework for visual classification for the subtle intra-class object variations. This model can be trained efficiently in an end-to-end manner without bounding box or part annotations. Extensive experiments on three challenging fine-grained datasets demonstrate that the approach obtains state-of-the-art performance. The papers selected in this Special Issue highlight the extensive study of crowd understanding and analysis. It is hoped that these papers can play a role in promoting theoretical research. Meanwhile, there are many challenges in this field that need to be further studied, such as the robustness of algorithms in cross-scenarios, the interaction between groups and individuals in different scenarios, and so forth. In addition, key problems such as the running time of the algorithm also need to be considered. These works will be helpful when applied to research in the real world. Lead Guest Editor Professor Qi Wang, Northwestern Polytechnical University, Xi'an, China Guest Editors Assistant Professor Bo Liu, Auburn University, USA Dr. Jianzhe Lin, University of British Columbia, Canada

Crowded Scene Understanding by Deeply Learned Attributes∗

Deeply Learned Attributes for Crowded Scene Understanding.

Crowded Scene Understanding by Deeply Learned Volumetric Slices

Exploiting Attribute Dependency for Attribute Assignment in Crowded Scenes

Learning Scene-Independent Group Descriptors for Crowd Understanding

Crowd Video Retrieval Via Deep Attribute-Embedding Graph Ranking

Crowded Scene Analysis: A Survey.

What Happens in Crowd Scenes: A New Dataset about Crowd Scenes for Image Captioning

Dynamic Attribute Package - Crowd Behavior Recognition in Complex Scene.

Scene-Independent Group Profiling in Crowd

Crowded Scene Analysis: A Survey

Learning Attributes from the Crowdsourced Relative Labels.

Zero-Shot Crowd Behavior Recognition

Crowded Scene Understanding Algorithm Based on Two-Stream Residual Network

Scene Attribute Semantic Relational Regularization for Transport-Travel Scene Understanding

Crowd Understanding and Analysis

Attribute-Aware Pedestrian Detection in a Crowd

Crowd Characterization in Surveillance Videos Using Deep-Graph Convolutional Neural Network

CrowdCaption++: Collective-guided Crowd Scenes Captioning

Discovering Attribute Shades of Meaning with the Crowd

Recurrent Prediction with Spatio-Temporal Attention for Crowd Attribute Recognition