Unlocking the Potential of Multimodal Unified Discrete Representation Through Training-Free Codebook Optimization and Hierarchical Alignment

Hai Huang,Yan Xia,Shengpeng Ji,Shulei Wang,Hanting Wang,Jieming Zhu,Zhenhua Dong,Zhou Zhao
DOI: https://doi.org/10.48550/arxiv.2403.05168
2024-01-01
Abstract:Recent advances in representation learning have demonstrated the significanceof multimodal alignment. The Dual Cross-modal Information Disentanglement(DCID) model, utilizing a unified codebook, shows promising results inachieving fine-grained representation and cross-modal generalization. However,it is still hindered by equal treatment of all channels and neglect of minorevent information, resulting in interference from irrelevant channels andlimited performance in fine-grained tasks. Thus, in this work, We propose aTraining-free Optimization of Codebook (TOC) method to enhance modelperformance by selecting important channels in the unified space withoutretraining. Additionally, we introduce the Hierarchical Dual Cross-modalInformation Disentanglement (H-DCID) approach to extend information separationand alignment to two levels, capturing more cross-modal details. The experimentresults demonstrate significant improvements across various downstream tasks,with TOC contributing to an average improvement of 1.70tasks, and H-DCID surpassing DCID by an average of 3.64TOC and H-DCID further enhances performance, exceeding DCID by 4.43findings highlight the effectiveness of our methods in facilitating robust andnuanced cross-modal learning, opening avenues for future enhancements. Thesource code and pre-trained models can be accessed athttps://github.com/haihuangcode/TOC_H-DCID.
What problem does this paper attempt to address?