Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions

Anil Rahate,Rahee Walambe,Sheela Ramanna,Ketan Kotecha
DOI: https://doi.org/10.1016/j.inffus.2021.12.003
2022-01-04
Abstract:Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. Multimodal machine learning involves multiple aspects: representation, translation, alignment, fusion, and co-learning. In the current state of multimodal machine learning, the assumptions are that all modalities are present, aligned, and noiseless during training and testing time. However, in real-world tasks, typically, it is observed that one or more modalities are missing, noisy, lacking annotated data, have unreliable labels, and are scarce in training or testing and or both. This challenge is addressed by a learning paradigm called multimodal co-learning. The modeling of a (resource-poor) modality is aided by exploiting knowledge from another (resource-rich) modality using transfer of knowledge between modalities, including their representations and predictive models. Co-learning being an emerging area, there are no dedicated reviews explicitly focusing on all challenges addressed by co-learning. To that end, in this work, we provide a comprehensive survey on the emerging area of multimodal co-learning that has not been explored in its entirety yet. We review implementations that overcome one or more co-learning challenges without explicitly considering them as co-learning challenges. We present the comprehensive taxonomy of multimodal co-learning based on the challenges addressed by co-learning and associated implementations. The various techniques employed to include the latest ones are reviewed along with some of the applications and datasets. Our final goal is to discuss challenges and perspectives along with the important ideas and directions for future work that we hope to be beneficial for the entire research community focusing on this exciting domain.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are that in real - world tasks, multimodal machine - learning systems face one or more problems such as missing modal data, noise interference, insufficient labeled data, unreliable labels, and scarce modal data in the training or testing phases. These problems are very common in practical applications, such as processing speech and gestures under harsh acoustic and visual conditions, or recognizing images under different lighting conditions. To solve these problems, the paper proposes a learning paradigm called Multimodal Co - learning. This paradigm realizes knowledge transfer between modalities by using the knowledge of another modality rich in resources to assist in modeling the modality poor in resources, thereby improving the performance and robustness of the system. Specifically, the paper focuses on the following aspects: 1. **Existence of Modalities**: Explore how to construct effective multimodal models when some or all of the modalities are missing during training or testing. 2. **Noisy Modalities**: Research how to handle data and labels containing noise. 3. **Modal Annotations**: Discuss how to learn when the modal data is partially labeled or unlabeled. 4. **Domain Adaptation**: Explore how to achieve effective transfer of the model when the training and testing data sets or domains are different. 5. **Interpretability and Fairness**: Ensure that the prediction results of the model are both interpretable and unbiased. Through these studies, the paper aims to provide a comprehensive review of multimodal co - learning, including the latest progress, challenges, data sets, and applications, and to make suggestions for future research directions.