CLIP-enhanced multimodal machine translation: integrating visual and label features with transformer fusion

DOI: https://doi.org/10.1007/s11042-024-19480-6
IF: 2.577
2024-06-06
Multimedia Tools and Applications
Abstract:Multimodal machine translation is a technique that leverages computer vision to improve the quality of text translation. Most recent multimodal machine translation models only take into account visual features and disregard label features. Additionally, the text encoder weights are often not frozen during training, which can fail to validate the accuracy of the visual information. To address these issues, we propose a feature extraction method that utilizes the Contrastive Language-Image Pre-Training (CLIP) pre-trained model. Our approach involves fusing label features and text features using a multi-layer transformer, and then processing visual features with a visual encoder. We also load a text pre-training model and freeze the text encoder weights while fine-tuning the decoder weights during training. We conducted experiments on the Multi30K dataset to evaluate our proposed solution, and our results demonstrate its effectiveness and rationality.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?