Abstract:Human has an amazing cross-modal learning capability. In order to endow the computers with the same ability, we use a model based on the quotient space theory. In the quotient space model, representations at different modalities form a complete semi-order lattice and the translation from one modality to the others becomes easier. Therefore, it is suitable to be a mathematical model of cross-modal learning. Taking the video retrieval as an example, we show how to apply the cross-modal learning strategy to the field. The first problem of cross-modal learning in video retrieval is how to represent a video (content) so that the user expected videos can be found from a collection of videos precisely and entirely. A video can be represented by different modalities such as image, speech, text, etc. Each modality can be represented by several forms with different grain-sizes. Researches showed that, grain-size in the modality of image can bring compromise between precision and recall and multi-level feature may improve them both. But using only one modality to video retrieval is not enough. Speech and keyword are used as well. One of the strategies for cross-modal learning is to integrate information from different sense modalities. The second problem is how to integrate the results from different modalities. That is feature binding or information fusion problem. Multi-classifier technique will be discussed. We may consider each modality as a projection of the same object (video) and integrate information from the projections. Specifically, we propose the Probabilistic Model Supported Rank Aggregation (PMSRA) method to accomplish this integration. Theoretical analysis and experimental results show that cross-modal learning can significantly improve the performances of machine learning and that the quotient space model is powerful for it.

Weak model image classification and obj ect detection with affluent strong model information

A Model-Agnostic Framework for Universal Anomaly Detection of Multi-organ and Multi-modal Images

Learning by Actively Querying Strong Modal Features

Improving Fine-grained Image Classification with Multimodal Information

Robust object recognition via weakly supervised metric and template learning.

An Image Object Detection Model Based on Mixed Attention Mechanism Optimized YOLOv5

Weakly supervised object-aware convolutional neural networks for semantic feature matching

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

Weakly Aligned Feature Fusion for Multimodal Object Detection

Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification from the Bottom Up

Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

Object Recognition via Adaptive Multi-level Feature Integration

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Weakly Paired Multimodal Fusion for Object Recognition.

Multimodal Image Aesthetic Prediction with Missing Modality

Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

Fine-Grained Scene Image Classification with Modality-Agnostic Adapter

Auxiliary Information Regularized Machine for Multiple Modality Feature Learning

Cross-Modal Learning - The Learning Methodology Inspired by Human's Intelligence1

A Multimodal Feature Representation Model for Transfer-Learning-Based Identification of Images

On-the-fly Modulation for Balanced Multimodal Learning