Abstract:Structured prediction models aim at solving a type of problem where the output is a complex structure, rather than a single variable. Performing knowledge distillation for such models is not trivial due to their exponentially large output space. In this work, we propose an approach that is much simpler in its formulation and far more efficient for training than existing approaches. Specifically, we transfer the knowledge from a teacher model to its student model by locally matching their predictions on all sub-structures, instead of the whole output space. In this manner, we avoid adopting some time-consuming techniques like dynamic programming (DP) for decoding output structures, which permits parallel computation and makes the training process even faster in practice. Besides, it encourages the student model to better mimic the internal behavior of the teacher model. Experiments on two structured prediction tasks demonstrate that our approach outperforms previous methods and halves the time cost for one training epoch.
What problem does this paper attempt to address?
This paper attempts to solve the problem of knowledge distillation in structured prediction models. Specifically, it aims to address the following two key issues:
1. **Huge output space**: The output of a structured prediction model is a complex structure (e.g., sequence labels), rather than a single variable. Therefore, its output space grows exponentially with the sequence length. Direct application of traditional knowledge distillation methods is computationally infeasible because of the need to handle the huge output space.
2. **Inefficient existing methods**: Existing structured knowledge distillation methods rely on some time - consuming techniques, such as dynamic programming (DP) or K - best decoding, which make the training process very slow and difficult to parallelize.
To solve these problems, the authors propose an efficient knowledge distillation method by locally matching the predictions of the teacher model and the student model on all sub - structures, rather than on the entire output space. This method avoids the need for global search of the output structure, thereby significantly improving the training efficiency and enabling the student model to better imitate the internal behavior of the teacher model.
### Specific solutions
- **Locally match sub - structure predictions**: By locally matching sub - structure predictions instead of the entire output structure, time - consuming techniques such as dynamic programming can be avoided.
- **Parallel matrix calculation**: Due to the simplicity of the method, efficient parallel matrix calculations can be carried out on the GPU, further reducing the training time.
### Experimental results
The experimental results show that this method not only has a significant improvement in training efficiency (reducing the training time by more than half compared to other methods), but also outperforms existing structured knowledge distillation methods in performance.
### Summary of mathematical formulas
- **CRF score function**:
\[
s(y_l, y_{l - 1}, x_l)=t_{y_{l - 1}, y_l}+e_l
\]
where \(t_{y_{l - 1}, y_l}\) is the transition score and \(e_l\) is the emission score.
- **Conditional probability**:
\[
p(y|x)=\frac{1}{Z(x)}\prod_{l = 1}^L\exp\{s(y_l, y_{l - 1}, x_l)\}
\]
where \(Z(x)\) is the normalization factor.
- **Knowledge distillation loss**:
\[
L_{KD}=-\sum_{u\in U(x)}p'(u|x)\log p(u|x)
\]
where \(u\) represents any possible adjacent label pair \(\{y_l, y_{l - 1}\}\), and \(p'\) is the conditional probability calculated by the teacher model.
- **Sub - structure prediction difference**:
\[
L_{KD}=\frac{1}{|U(x)|}\sum_{u\in U(x)}\|s(u, x)-s'(u, x)\|^2
\]
where \(s'(u, x)\) is the sub - structure score calculated by the teacher model.
Through these improvements, this method can greatly improve the training efficiency while maintaining high precision.