CKDH: CLIP-based Knowledge Distillation Hashing for Cross-modal Retrieval

Jiaxing Li,Wai Keung Wong,Lin Jiang,Xiaozhao Fang,Shengli Xie,Yong Xu
DOI: https://doi.org/10.1109/tcsvt.2024.3350695
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Recently, deep hashing-based cross-modal retrieval has attracted much attention of researchers, due to its advantages of fast retrieval efficiency and low storage overhead, etc. However, the existing deep hashing-based cross-modal retrieval methods typically 1) suffer from inadequately capturing the semantic relevance and coexistent information for cross-modal data, which may result in sub-optimal retrieval performance, 2) require a more comprehensive similarity measurement for cross-modal features to ensure high retrieval accuracy, 3) lack of scalability for lightweight deployment framework. To handle the issues mentioned above, we propose a CLIP-based knowledge distillation hashing (CKDH) for cross-modal retrieval, by referring the research trend of combining traditional methods and modern neural architecture to design lightweight networks based on large language models. Specifically, to effectively help capture the semantic relevance and coexistent information, CLIP is fine-tuned to extract visual features, while a graph attention network is used to enhance textual features extracted by bag-of-words model in the teacher model. Then, for better supervising the training of student model, a more comprehensive similarity measurement is introduced to represent distilled knowledge by jointly preserving the log-likelihood, intra and inter modality similarities. Finally, the student model extracts deep features by a lightweight networks, and generates the hash codes under the supervision of the similarity matrix produced by the teacher model. Experimental results on three widely used datasets demonstrate that CKDH can outperform some state-of-the-art methods, by delivering the best result consistently.
engineering, electrical & electronic
What problem does this paper attempt to address?