Guest Editorial Introduction to the Issue on Pre-Trained Models for Multi-Modality Understanding

Wengang Zhou,Jiajun Deng,Niculae Sebe,Qi Tian,Alan L. Yuille,Concetto Spampinato,Zakia Hammal
DOI: https://doi.org/10.1109/tmm.2024.3384680
2024-01-01
Abstract:In the ever-evolving domain of multimedia, the significance of multi-modality understanding cannot be overstated. As multimedia content becomes increasingly sophisticated and ubiquitous, the ability to effectively combine and analyze the diverse information from different types of data, such as text, audio, image, video and point clouds, will be paramount in pushing the boundaries of what technology can achieve in understanding and interacting with the world around us. Accordingly, multi-modality understanding has attracted a tremendous amount of research, establishing itself as an emerging topic. Pre-trained models, in particular, have revolutionized this field, providing a way to leverage vast amounts of data without task-specific annotation to facilitate various downstream tasks.
What problem does this paper attempt to address?