Abstract:Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames. Since these actions manifest at varying scales of granularity, ranging from coarse to fine, projecting an entire set of action frames to a single latent encoding may result in a lack of local information, necessitating the acquisition of action features across multiple scales. In this paper, we propose a multi-scale action learning transformer (MALT), which includes a novel recurrent decoder (used for feature fusion) that includes fewer parameters and can be trained more efficiently. A hierarchical encoder with multiple encoding branches is further proposed to capture multi-scale action features. The output from the preceding branch is then incrementally input to the subsequent branch as part of a cross-attention calculation. In this way, output features transition from coarse to fine as the branches deepen. We also introduce an explicit frame scoring mechanism employing sparse attention, which filters irrelevant frames more efficiently, without requiring an additional network. The proposed method achieved state-of-the-art performance on two benchmark datasets (THUMOS'14 and TVSeries), outperforming all existing models used for comparison, with an mAP of 0.2% for THUMOS'14 and an mcAP of 0.1% for TVseries.
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two key challenges in Online Action Detection (OAD):
1. **Multi - scale action feature capture**: In real - world scenarios, actions are usually presented in details of different granularities, from rough overall actions to fine - grained specific components. For example, in a nine - ball game, the hitting action consists of several parts such as the preparatory action, aiming, and hitting. Projecting all action frames onto a single latent code may lead to the loss of local information because the encoder can only learn the weights of all action frames, not the weights for each specific component separately. Therefore, a method is needed to capture multi - scale action features.
2. **Irrelevant frame filtering**: Not all historical frames are useful for the prediction of the current frame, and there are many irrelevant or redundant frames. These irrelevant frames will lead to incorrect projections and high time complexity. Therefore, a mechanism is required to efficiently filter out these irrelevant frames.
To address these two challenges, the paper proposes a Multi - scale Action Learning Transformer (MALT), and its main contributions are as follows:
- **Multi - scale action feature capture**: MALT adopts a Hierarchical Encoder, which contains multiple encoding branches with different depths to capture multi - scale action features. By gradually inputting the output of the previous branch as the input of the subsequent branch, the output features gradually transition from coarse to fine.
- **Irrelevant frame filtering**: An explicit frame scoring mechanism is introduced, using Sparse Attention to evaluate the importance of historical frames and select the most informative frames. This not only improves accuracy but also reduces the running time.
- **Efficient feature fusion**: A Recurrent Decoder is adopted, which can effectively fuse features from different scales while reducing the number of parameters, making model training more efficient.
Finally, MALT has reached the state - of - the - art in network performance and outperforms all existing models on two benchmark datasets, THUMOS’14 and TVSeries, achieving a 0.2% mAP improvement and a 0.1% mcAP improvement respectively.
### Formula summary
- **Sparse attention mechanism**:
\[
Q = X_1W_q, \quad K = X_2W_k, \quad V = X_2W_v
\]
\[
A=\frac{QK^T}{\sqrt{D}}
\]
\[
\Psi(A, k)_{ij}=
\begin{cases}
A_{ij} & \text{if } A_{ij}\geq t_i \\
-\infty & \text{if } A_{ij}\leq t_i
\end{cases}
\]
\[
\text{SparseAttn}=\text{Softmax}(\Psi(A, k))V
\]
- **Hierarchical encoder**:
\[
f_n^1 = \text{SparseAttn}(\lambda', M_L)
\]
\[
f_p^n=\text{CrossAttn}(f_{p - 1}^{n - 1}, f_{p - 1}^n)
\]
- **Recurrent decoder**:
\[
\text{Out}_n=\text{CrossAttn}(Q', f_n)
\]
These formulas show how MALT effectively deals with multi - scale action feature and irrelevant frame filtering problems through the sparse attention mechanism and the hierarchical encoder.