Abstract:. Pooling methods are necessities for modern neural networks for increasing receptive fields and lowering down computational costs. However, commonly used hand-crafted pooling approaches, e.g., max pooling and average pooling, may not well preserve discriminative features. While many researchers have elaborately designed various pooling variants in spatial domain to handle these limitations with much progress, the temporal aspect is rarely visited where directly applying hand-crafted methods or these specialized spatial variants may not be optimal. In this paper, we derive temporal lift pooling (TLP) from the Lifting Scheme in signal processing to intelligently downsample features of different temporal hierarchies. The Lifting Scheme factorizes input signals into various sub-bands with different frequency, which can be viewed as different temporal movement patterns. Our TLP is a three-stage procedure, which performs signal decomposition, component weighting and information fusion to generate a refined downsized feature map. We select a typical temporal task with long sequences, i.e. continuous sign language recognition (CSLR), as our testbed to verify the effectiveness of TLP. Experiments on two large-scale datasets show TLP outperforms hand-crafted methods and specialized spatial variants by a large margin (1.5%) with similar computational overhead. As a robust feature extractor, TLP exhibits great generalizability upon multiple backbones on various datasets and achieves new state-of-the-art results on two large-scale CSLR datasets. Visualizations further demonstrate the mechanism of TLP in correcting gloss borders. Code is released 1 . exhibits excellent generalizability upon multiple backbones upon two large-scale CSLR datasets with significant performance boost. Visualizations verify the effects of TLP for correcting gloss borders.

Spatial Temporal Aggregation for Efficient Continuous Sign Language Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition

Scalable Frame Resolution for Efficient Continuous Sign Language Recognition

Spatial–temporal transformer for end-to-end sign language recognition

Multimodal Spatiotemporal Networks for Sign Language Recognition

Event-Driven Spiking Learning Algorithm Using Aggregated Labels

Temporal Lift Pooling for Continuous Sign Language Recognition

Efficient 3D CNNs with knowledge transfer for sign language recognition

Continuous Sign Language Recognition Via Temporal Super-Resolution Network

Temporal superimposed crossover module for effective continuous sign language

AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition

Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

Improving Continuous Sign Language Recognition with Adapted Image Models

Combinational sign language recognition

Towards Online Continuous Sign Language Recognition and Translation

Video-Based Sign Language Recognition Without Temporal Segmentation

Multi-scale Context-Aware Network for Continuous Sign Language Recognition

Spatial-Temporal Consistency Constraints for Chinese Sign Language Synthesis.

Interactive attention and improved GCN for continuous sign language recognition

Asymmetric multi-branch GCN for skeleton-based sign language recognition