Abstract:Benefiting from the generalization capability of CLIP, recent vision language pre-training (VLP) models have demonstrated an impressive ability to capture virtually any visual concept in daily images. However, due to the presence of unseen categories in open-vocabulary settings, existing algorithms struggle to effectively capture strong semantic correlations between categories, resulting in sub-optimal performance on the open-vocabulary multi-label recognition (OV-MLR). Furthermore, the substantial variation in the number of discriminative areas across diverse object categories is misaligned with the fixed-number patch matching used in current methods, introducing noisy visual cues that hinder the accurate capture of target semantics. To tackle these challenges, we propose a novel category-adaptive cross-modal semantic refinement and transfer (C$^2$SRT) framework to explore the semantic correlation both within each category and across different categories, in a category-adaptive manner. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Specifically, the ISR module leverages the cross-modal knowledge of the VLP model to adaptively find a set of local discriminative regions that best represent the semantics of the target category. The IST module adaptively discovers a set of most correlated categories for a target category by utilizing the commonsense capabilities of LLMs to construct a category-adaptive correlation graph and transfers semantic knowledge from the correlated seen categories to unseen ones. Extensive experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$^2$SRT framework outperforms current state-of-the-art algorithms.

OpenSR: Open-Modality Speech Recognition Via Maintaining Multi-Modality Alignment.

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition

Audio-visual Recognition of Overlapped speech for the LRS2 dataset

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

Boosting Continuous Sign Language Recognition via Cross Modality Augmentation

Prior-aware Cross Modality Augmentation Learning for Continuous Sign Language Recognition

Multi-modal co-learning for silent speech recognition based on ultrasound tongue images

Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization

Learning adversarial semantic embeddings for zero-shot recognition in open worlds

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

Multi-Modal Zero-Shot Sign Language Recognition

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

OVMR: Open-Vocabulary Recognition with Multi-Modal References