CAPEEN: Image Captioning with Early Exits and Knowledge Distillation

Divya Jyoti Bajpai,Manjesh Kumar Hanawal
2024-10-06
Abstract:Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at <a class="link-external link-https" href="https://github.com/Div290/CapEEN" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the image captioning task, although deep neural networks (DNNs) have significantly improved performance, they bring higher computational burdens and inference latencies. To improve efficiency, the Early Exit (EE) strategy has been proposed, but applying the EE strategy in the image captioning task is challenging because this task requires semantic information at different levels for accurate prediction. Specifically, the paper proposes the following problems and solutions: 1. **Improving the performance of the EE strategy**: - **Problem**: Directly applying traditional EE strategies to pre - trained backbone networks may not be suitable for the image captioning task because features at different levels are crucial for generating accurate captions. - **Solution**: Introduce the CAPEEN framework, which uses knowledge distillation to transfer the knowledge of deep - level features to shallow - level classifiers, thereby improving the performance of early exit. 2. **Adapting to changes in the data distribution during inference**: - **Problem**: During the inference process, the distribution of target samples may be different from that of training samples, resulting in a decline in model performance. - **Solution**: Propose the A - CAPEEN algorithm, which adjusts the exit threshold online based on the Multi - Armed Bandits (MAB) framework, enabling it to adapt to different levels of noise and distortion. 3. **Dynamically selecting the optimal exit threshold**: - **Problem**: How to dynamically select the optimal exit threshold during the inference process to balance computational efficiency and accuracy. - **Solution**: Learn the optimal exit threshold through the MAB framework, enabling the model to maintain efficient and robust performance under different data distributions. ### Formula summary - **Loss function** (for CAPEEN exit training): \[ L_i(I; \theta, \theta_e)=-\frac{1}{T} \sum_{t = 1}^T\left(\log(P_i(y_t^*|y_1^{t - 1}, I; \theta, \theta_e))+\text{KL}(p_t^i, p_t^n)\right) \] where $\theta$ is the set of all parameters, $I$ is the input image, $T$ is the caption length, $y_1^T$ is the ground - truth caption, $P_i$ is the probability score of the $i$-th student classifier, and $\text{KL}$ is the Kullback - Leibler divergence. - **Reward function** (for A - CAPEEN online learning): \[ r(\alpha)=\begin{cases}(C_i - C_1)-\mu o_i & \text{if } C_j < \alpha \text{ for } j\in[i - 1]\text{ and } C_i\geq\alpha\\(C_N - C_1)-\mu o_N & \text{if } C_j < \alpha \text{ for all } j\in[N - 1]\end{cases} \] where $\mu$ is the scaling factor and $o_i$ is the processing cost from the first exit layer to the $i$-th exit layer. Through these methods, CAPEEN and its variant A - CAPEEN aim to improve the inference efficiency in the image captioning task while maintaining high accuracy and robustness.