Cross-Modal Learning - The Learning Methodology Inspired by Human's Intelligence1

Bo Zhang,Dayong Ding,Ling Zhang
2007-01-01
Abstract:Human has an amazing cross-modal learning capability. In order to endow the computers with the same ability, we use a model based on the quotient space theory. In the quotient space model, representations at different modalities form a complete semi-order lattice and the translation from one modality to the others becomes easier. Therefore, it is suitable to be a mathematical model of cross-modal learning. Taking the video retrieval as an example, we show how to apply the cross-modal learning strategy to the field. The first problem of cross-modal learning in video retrieval is how to represent a video (content) so that the user expected videos can be found from a collection of videos precisely and entirely. A video can be represented by different modalities such as image, speech, text, etc. Each modality can be represented by several forms with different grain-sizes. Researches showed that, grain-size in the modality of image can bring compromise between precision and recall and multi-level feature may improve them both. But using only one modality to video retrieval is not enough. Speech and keyword are used as well. One of the strategies for cross-modal learning is to integrate information from different sense modalities. The second problem is how to integrate the results from different modalities. That is feature binding or information fusion problem. Multi-classifier technique will be discussed. We may consider each modality as a projection of the same object (video) and integrate information from the projections. Specifically, we propose the Probabilistic Model Supported Rank Aggregation (PMSRA) method to accomplish this integration. Theoretical analysis and experimental results show that cross-modal learning can significantly improve the performances of machine learning and that the quotient space model is powerful for it.
What problem does this paper attempt to address?