Abstract:Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at <a class="link-external link-https" href="https://github.com/Div290/CapEEN" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the image captioning task, although deep neural networks (DNNs) have significantly improved performance, they bring higher computational burdens and inference latencies. To improve efficiency, the Early Exit (EE) strategy has been proposed, but applying the EE strategy in the image captioning task is challenging because this task requires semantic information at different levels for accurate prediction. Specifically, the paper proposes the following problems and solutions: 1. **Improving the performance of the EE strategy**: - **Problem**: Directly applying traditional EE strategies to pre - trained backbone networks may not be suitable for the image captioning task because features at different levels are crucial for generating accurate captions. - **Solution**: Introduce the CAPEEN framework, which uses knowledge distillation to transfer the knowledge of deep - level features to shallow - level classifiers, thereby improving the performance of early exit. 2. **Adapting to changes in the data distribution during inference**: - **Problem**: During the inference process, the distribution of target samples may be different from that of training samples, resulting in a decline in model performance. - **Solution**: Propose the A - CAPEEN algorithm, which adjusts the exit threshold online based on the Multi - Armed Bandits (MAB) framework, enabling it to adapt to different levels of noise and distortion. 3. **Dynamically selecting the optimal exit threshold**: - **Problem**: How to dynamically select the optimal exit threshold during the inference process to balance computational efficiency and accuracy. - **Solution**: Learn the optimal exit threshold through the MAB framework, enabling the model to maintain efficient and robust performance under different data distributions. ### Formula summary - **Loss function** (for CAPEEN exit training): \[ L_i(I; \theta, \theta_e)=-\frac{1}{T} \sum_{t = 1}^T\left(\log(P_i(y_t^*|y_1^{t - 1}, I; \theta, \theta_e))+\text{KL}(p_t^i, p_t^n)\right) \] where $\theta$ is the set of all parameters, $I$ is the input image, $T$ is the caption length, $y_1^T$ is the ground - truth caption, $P_i$ is the probability score of the $i$-th student classifier, and $\text{KL}$ is the Kullback - Leibler divergence. - **Reward function** (for A - CAPEEN online learning): \[ r(\alpha)=\begin{cases}(C_i - C_1)-\mu o_i & \text{if } C_j < \alpha \text{ for } j\in[i - 1]\text{ and } C_i\geq\alpha\\(C_N - C_1)-\mu o_N & \text{if } C_j < \alpha \text{ for all } j\in[N - 1]\end{cases} \] where $\mu$ is the scaling factor and $o_i$ is the processing cost from the first exit layer to the $i$-th exit layer. Through these methods, CAPEEN and its variant A - CAPEEN aim to improve the inference efficiency in the image captioning task while maintaining high accuracy and robustness.

CAPEEN: Image Captioning with Early Exits and Knowledge Distillation

DECap: Towards Generalized Explicit Caption Editing Via Diffusion Mechanism

CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Enhancing Image Captioning Using Deep Convolutional Generative Adversarial Networks

An efficient automated image caption generation by the encoder decoder model

An image caption model based on attention mechanism and deep reinforcement learning

ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor

Advanced Generative Deep Learning Techniques for Accurate Captioning of Images

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Assessment of carbon nanotubes and silver nanoparticles loaded clays as adsorbents for removal of bacterial contaminants from water sources.

Adaptive Coati Optimization Enabled Deep CNN-based Image Captioning

MAENet: A Novel Multi-Head Association Attention Enhancement Network for Completing Intra-Modal Interaction in Image Captioning

Auto-Encoding and Distilling Scene Graphs for Image Captioning

Image Caption Generator Using Deep Learning

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Image Captioning with Adaptive Incremental Global Context Attention

Image-relevant Entities Knowledge aware News Image Captioning

Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking Evaluation Using Ensembled CLIP and Consensus Scores

ExpansionNet v2: Block Static Expansion in fast end to end training for Image Captioning