Abstract:Image Retrieval aims to retrieve corresponding images based on a given query. In application scenarios, users intend to express their retrieval intent through various query styles. However, current retrieval tasks predominantly focus on text-query retrieval exploration, leading to limited retrieval query options and potential ambiguity or bias in user intention. In this paper, we propose the Style-Diversified Query-Based Image Retrieval task, which enables retrieval based on various query styles. To facilitate the novel setting, we propose the first Diverse-Style Retrieval dataset, encompassing diverse query styles including text, sketch, low-resolution, and art. We also propose a light-weighted style-diversified retrieval framework. For various query style inputs, we apply the Gram Matrix to extract the query's textural features and cluster them into a style space with style-specific bases. Then we employ the style-init prompt tuning module to enable the visual encoder to comprehend the texture and style information of the query. Experiments demonstrate that our model, employing the style-init prompt tuning strategy, outperforms existing retrieval models on the style-diversified retrieval task. Moreover, style-diversified queries~(sketch+text, art+text, etc) can be simultaneously retrieved in our model. The auxiliary information from other queries enhances the retrieval performance within the respective query.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the contradiction between the diversity of user intention expressions and the limitations of the current retrieval models' adaptability to query styles in the image retrieval task. Specifically, the existing image retrieval tasks mainly focus on the exploration of text queries, while ignoring the retrieval capabilities of other query styles, such as sketches, art - style images, low - resolution images, etc. This has led to limitations in user intention expressions and potential ambiguities or deviations. To solve this problem, the author proposes a new task named "Style - Diversified Query - Based Image Retrieval" (SD - QBIR), and constructs a diverse - style query dataset (Diverse - Style Retrieval dataset, DSR) for this purpose to support the image retrieval needs of different query styles. In addition, the author also proposes a lightweight style - diversified retrieval framework - FreestyleRet, which can handle multiple query styles and improve retrieval performance.
### Core Contributions of the Paper
1. **Proposing the SD - QBIR Task**: Propose the image retrieval task with style - diversified queries for the first time, aiming to solve the problem of the diversity of user intention expressions.
2. **Constructing the DSR Dataset**: Create a dataset containing 10,000 natural images and their corresponding different - style queries (text, sketch, low - resolution, art - style).
3. **Designing the FreestyleRet Framework**: Propose a lightweight and plug - gable framework. Through the style - initialization prompt - tuning module, the pre - trained visual encoder can adapt to multiple query styles.
### Technical Methods
1. **Gram - based Style Extraction Module**: Use the Gram matrix to extract the texture features of the query.
2. **Style Space Construction Module**: Construct the style space by clustering the Gram matrices of all queries and use the clustering centers as the style bases.
3. **Style - Init Prompt Tuning Module**: Apply prompt - tuning on the frozen visual encoder. Through the style - initialization prompt tokens, the encoder can understand the query inputs of different styles.
### Experimental Results
- **Performance Comparison**: The experimental results on the DSR and ImageNet - X two benchmark datasets show that FreestyleRet has better retrieval performance under multiple query styles than the existing cross - modal and multi - modal models.
- **Multi - style Queries**: FreestyleRet can handle multiple query styles simultaneously (such as sketch + text, art - style + text, etc.), and these auxiliary information can enhance the retrieval performance mutually.
- **Computational Efficiency**: FreestyleRet has high computational efficiency in terms of the number of parameters and inference speed, which is suitable for rapid deployment and application.
### Conclusion
This paper effectively solves the limitations of the existing image retrieval models in query - style adaptability by proposing the SD - QBIR task and the FreestyleRet framework, providing users with a more flexible and accurate image retrieval solution.