Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model

Ye Liu,Yan Pan,Jian Yin
DOI: https://doi.org/10.1109/lsp.2024.3455991
2024-09-27
IEEE Signal Processing Letters
Abstract:Deep hashing algorithms can transform high-dimensional features into low-dimensional hash codes, which can reduce storage space and improve computational efficiency in traditional information retrieval (IR) and large model related retrieval augmented generation (RAG) scenarios. In recent years, pre-trained convolutional or transformer networks are commonly chosen as the backbone in deep hashing frameworks. This involves incorporating local loss constraints among training samples, and then fine-tuning the model to generate hash codes. Due to the relatively limited local information of constraints among training samples, we propose to design the novel anchor constraint and structural constraint as internal global loss constraints with the vision transformer network, and augment external information by integrating the large vision-language model, thereby enhancing the performance of hash code generation. Additionally, to enhance the scalability of the novel deep hashing framework, we propose to incorporate the adapter module to extend its application from the image domain to the audio domain. By conducting comparative experiments and ablation analysis on various image and audio datasets, it can be confirmed that the proposed method achieves state-of-the-art retrieval results.
engineering, electrical & electronic
What problem does this paper attempt to address?