Less is Better: Exponential Loss for Cross-Modal Matching

Jiwei Wei,Yang Yang,Xing Xu,Jingkuan Song,Guoqing Wang,Heng Tao Shen
DOI: https://doi.org/10.1109/tcsvt.2023.3249754
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Deep metric learning has become a key component of cross-modal retrieval. By learning to pull the features of matched instances closer while pushing the features of mismatched instances farther away, one can learn highly robust multi-modal representations. Most existing cross-modal retrieval methods leverage vanilla triplet loss to train the network, which cannot adaptively penalize pairs with different hardness. Although various weighting strategies have been designed for unimodal matching tasks, few weighting strategies have been applied to cross-modal tasks due to the specificity of cross-modal tasks. While few weighting strategies are designed for cross-modal scenarios, they usually involve a lot of hyper-parameters, which require a lot of computational resources to fine-tune. In this paper, we introduce a new exponential loss, which can assign appropriate weights to individual positive and negative pairs according to their similarity so that it can adaptively penalize pairs with different hardness. Furthermore, the exponential loss has only two hyper-parameters, making it easier to find the optimal parameters to suit various data distributions in practice. Exponential loss can be universally applied to well-established cross-modal models and further boost their retrieval performance. We exhaustively ablate our method on Image-Text matching, Video-Text matching, as well as unimodal Image matching. Experimental results show that a standard model trained with exponential loss can achieve noticeable performance gains.
engineering, electrical & electronic
What problem does this paper attempt to address?