Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize novel \textit{state-object} compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In this paper, our interest is to revisit the conditional transport (CT) theory and its homology to the visual-semantics interaction in CZSL and further, propose a novel Trisets Consistency Alignment framework (dubbed TsCA) that well-addresses these issues. Concretely, we utilize three distinct yet semantically homologous sets, i.e., patches, primitives, and compositions, to construct pairwise CT costs to minimize their semantic discrepancies. To further ensure the consistency transfer within these sets, we implement a cycle-consistency constraint that refines the learning by guaranteeing the feature consistency of the self-mapping during transport flow, regardless of modality. Moreover, we extend the CT plans to an open-world setting, which enables the model to effectively filter out unfeasible pairs, thereby speeding up the inference as well as increasing the accuracy. Extensive experiments are conducted to verify the effectiveness of the proposed method.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two key challenges in **Compositional Zero - Shot Learning (CZSL)**: 1. **Bias calibration between semantically similar multi - modal representations**: In the CZSL task, the model needs to handle the complex relationship between images and text descriptions. However, existing methods have biases when aligning semantically similar multi - modal representations, resulting in poor performance of the model in recognizing new combinations. 2. **Generalizing pre - trained knowledge to new combination contexts**: Although the model can be trained on known combinations, it is an ongoing challenge how to effectively utilize pre - trained knowledge when facing completely new, unseen state - object combinations. To address these challenges, the paper proposes a new framework - **Trisets Consistency Alignment (TsCA)**, which is based on the Conditional Transport (CT) theory and minimizes the semantic differences between three different but semantically homologous sets (i.e., patch set, original set, and composition set). Specifically, TsCA solves the problems in the following ways: - **Constructing triple - consistency alignment**: Utilize the local features of the image (patch set), global text concepts (composition set), and local text concepts (original set), and align these three sets through conditional transport. - **Introducing cycle - consistency constraints**: Ensure that the consistency of features is maintained during the transport process, regardless of the modality change. - **Extending the CT scheme to the open - world setting**: Enable the model to effectively filter out infeasible combinations, thereby accelerating inference and improving accuracy. Through these methods, TsCA can better capture the intrinsic relationships between images, compositions, and original elements, thereby improving the performance of the CZSL task. ### Formula summary 1. **Probability distribution of the patch set**: \[ P_1=\sum_{n = 1}^N\theta_n\delta_{x_n},\quad\theta_n=\frac{1}{N} \] 2. **Probability distribution of the composition set**: \[ P_2=\sum_{m = 1}^M\alpha_m\delta_{y_m},\quad\alpha=\sigma(y^Tx_{CLS}) \] 3. **Probability distribution of the original set**: \[ P_3=\sum_{k = 1}^K\beta_k\delta_{z_k},\quad\beta=\sigma(\beta_s\oplus\beta_o) \] 4. **Conditional transport distance**: \[ CT(P_1, P_2)=\min_{\overrightarrow{T},\overleftarrow{T}}\left(\sum_{n,m}\overrightarrow{t}_{nm}c(x_n,y_m)+\sum_{m,n}\overleftarrow{t}_{mn}c(y_m,x_n)\right) \] 5. **Cycle - consistency constraint**: \[ L_{cyc}=\sum_{m = 1}^M y_m^c(T_{22}-I) \] 6. **Decoupling loss**: \[ L_{de}=\|\cos(x_{CLS}^s,z_o^{gt})\|+\|\cos(x_{CLS}^o,z_s^{gt})\| \] 7. **Overall training loss**: \[ L=\lambda_0L_{base}+\lambda_1CT+\lambda_2L_{cyc}+\lambda_3L_{de} \] Through this

TsCA: On the Semantic Consistency Alignment via Conditional Transport for Compositional Zero-Shot Learning

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Agree to Disagree: Exploring Partial Semantic Consistency against Visual Deviation for Compositional Zero-Shot Learning

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Learning Attention as Disentangler for Compositional Zero-shot Learning

Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training

Learning Conditional Attributes for Compositional Zero-Shot Learning

Continual Compositional Zero-Shot Learning

MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning

Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning

Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning.

Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Compositional Zero-shot Learning Via Progressive Language-based Observations

Simple Primitives With Feasibility- and Contextuality-Dependence for Open-World Compositional Zero-Shot Learning

Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning

Meta-Transfer Networks for Zero-Shot Learning

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Consistent Structural Relation Learning for Zero-Shot Segmentation.

Reference-Limited Compositional Zero-Shot Learning

Zero-Shot Compositional Concept Learning

Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning