Multi-temporal dependency handling in video smoke recognition: A holistic approach spanning spatial, short-term, and long-term perspectives
Feng Yang,Qifan Xue,Yichao Cao,Xuanpeng Li,Weigong Zhang,Guangyu Li
DOI: https://doi.org/10.1016/j.eswa.2023.123081
IF: 8.5
2024-01-06
Expert Systems with Applications
Abstract:Accurately recognizing video-based smoke is still a profoundly challenging task due to the special characteristics of smoke, such as non-rigid morphology, semi-transparent representations, serious interferences (namely non-smoky white steams and clouds etc.), and so on. In this study, we present the Spatio-Temporal Local-Enhanced Network ( STENet ), which is meticulously designed to handle the spatio-temporal dependencies of smoke video in multi-temporal views, effectively synthesizing spatial, long-term, and short-term smoke features. To account for the semi-translucent nature of smoke, we introduce a short-term temporal branch within STENet that models the transient temporal dependencies between adjacent frames, thereby distinguishing smoke from visually similar interferences such as clouds and water steam. Furthermore, the representation of spatial and short-term temporal features, which capture the instantaneous spatio-temporal characteristics of a single smoke plume, can be fine-grained aligned and mutually guided, which is accomplished through our Collaborative Alignment(CoAM). Moreover, to address the unique drifting characteristics and dynamic variations of non-rigid objects across different temporal scales, we propose a new Prioritized Local-enhanced Transformer (PLT) that selectively focuses on key frame subsets for capturing long-term temporal dependencies. Experiments conducted on benchmarks demonstrate that the proposed STENet achieves state-of-the-art results on large-scale video dataset for recognizing industrial smoke emissions (RISE) and forest fire surveillance video (FSmoke). In particular, our STENet achieves 0.893 F-score on RISE dataset with 6.9M parameters and 3.86G FLOPs, being 4 × -18 × smaller and using 5 × -51 × fewer FLOPs than previous baseline models.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science