Multi-Modal Knowledge Representation: A Survey.

Weiqi Hu,Ye Wang,Yan Jia
DOI: https://doi.org/10.1109/DSC59305.2023.00020
2023-01-01
Abstract:A modality refers to the way people receive information, such as vision, hearing, tactile sense, etc. If a task or data set involves multiple modalities, it is called multi-modal. In recent years, there has been more and more research in the field of multi-modal. Especially in the area of vision and language, many studies combine the excellent work of CV and NLP. With the development of knowledge graphs and representation, many works have also begun utilizing external knowledge to improve performance. To explore how multi-modal information and knowledge are utilized, this paper provides a comprehensive survey on multi-modal knowledge representation, which has never been systematically discussed. To facilitate the discussion of how the models combine and utilize multi-modal information and knowledge, we categorize multi-modal knowledge representation methods into two frameworks: explicit multi-modal knowledge representation and implicit multi-modal knowledge representation. For each framework, we introduce some typical models in vision-and-Ianguage, to help new researchers in this field understand the multi-modal knowledge representation model better. In addition, we introduce some downstream tasks and propose some important directions for future work.
What problem does this paper attempt to address?