Multi-modal Cognitive Computing

Xuelong LI
DOI: https://doi.org/10.1360/ssi-2022-0226
2022-01-01
Scientia Sinica Informationis
Abstract:The human brain perceives its surroundings through multiple sensory organs and integrates these multi-sensory perceptions to generate a comprehensive understanding. Inspired by synaesthesia, multi-modal cognitive computing endows machines with multi-sensory capabilities and has become the key to general artificial intelligence. With the explosion of multi-modal data such as image, video, text, and audio, a large number of methods have been developed to address this topic. However, the theoretical basis of multi-modal cognitive computing is still unclear. From the perspective of information theory, this paper establishes an information transmission model to profile the cognitive process. Based on the theory of information capacity, this study finds out that multi-modal cognitive computing helps machines extract more information. In this way, multi-modal cognitive computing research is unified by the same theoretical basis. Then, the development of typical tasks is reviewed and discussed, including multi-modal correlation, cross-modal generation, and multi-modal collaboration. Finally, focusing on the opportunities and challenges faced by multi-modal cognitive computing, some potential directions are discussed in depth, and several open-ended questions are considered.
What problem does this paper attempt to address?