Abstract:. Pooling methods are necessities for modern neural networks for increasing receptive fields and lowering down computational costs. However, commonly used hand-crafted pooling approaches, e.g., max pooling and average pooling, may not well preserve discriminative features. While many researchers have elaborately designed various pooling variants in spatial domain to handle these limitations with much progress, the temporal aspect is rarely visited where directly applying hand-crafted methods or these specialized spatial variants may not be optimal. In this paper, we derive temporal lift pooling (TLP) from the Lifting Scheme in signal processing to intelligently downsample features of different temporal hierarchies. The Lifting Scheme factorizes input signals into various sub-bands with different frequency, which can be viewed as different temporal movement patterns. Our TLP is a three-stage procedure, which performs signal decomposition, component weighting and information fusion to generate a refined downsized feature map. We select a typical temporal task with long sequences, i.e. continuous sign language recognition (CSLR), as our testbed to verify the effectiveness of TLP. Experiments on two large-scale datasets show TLP outperforms hand-crafted methods and specialized spatial variants by a large margin (1.5%) with similar computational overhead. As a robust feature extractor, TLP exhibits great generalizability upon multiple backbones on various datasets and achieves new state-of-the-art results on two large-scale CSLR datasets. Visualizations further demonstrate the mechanism of TLP in correcting gloss borders. Code is released 1 . exhibits excellent generalizability upon multiple backbones upon two large-scale CSLR datasets with significant performance boost. Visualizations verify the effects of TLP for correcting gloss borders.

Temporal feature extraction based on CNN-BLSTM and temporal pooling for language identification

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

CNN-Based End-To-End Language Identification

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Temporal Lift Pooling for Continuous Sign Language Recognition

Deep temporal representation learning for language identification

Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

Language Identification Based on Convolutional Neural Network

Phonetic Temporal Neural Model for Language Identification

End-to-End Language Identification Using High-Order Utterance Representation with Bilinear Pooling.

A Self-Supervised Model for Language Identification Integrating Phonological Knowledge

Language identification based on multi-scale and multi-dimensional convolution

An Improved LSTM for Language Identification

Insights into End-to-End Learning Scheme for Language Identification

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

Acoustic scene classification using multi-layer temporal pooling based on convolutional neural network.

Deep CNNs along the Time Axis with Intermap Pooling for Robustness to Spectral Variations

Dynamic TF-TDNN: Dynamic Time Delay Neural Network Based on Temporal-Frequency Attention for Dialect Recognition

Language Identification with Deep Bottleneck Features

End-to-end Oriental Language Speech Recognition with Integrated Language Identification

Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling