Abstract:Multimodal learning is a rapidly growing research field that has revolutionized multitasking and generative modeling in AI. While much of the research has focused on dealing with unstructured data (e.g., language, images, audio, or video), structured data (e.g., tabular data, time series, or signals) has received less attention. However, many industry-relevant use cases involve or can be benefited from both types of data. In this work, we propose a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data. MAGNUM is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of simultaneously processing structured data (such as tabular data, time series or signals) and unstructured data (such as language, image, audio or video) in multimodal learning. Although much of the current research mainly focuses on unstructured data, many industry - related application scenarios actually require the combination of these two types of data. For example, in healthcare, the combination of patient records (structured data) with diagnostic images and doctor's notes (unstructured data) can improve diagnostic accuracy and develop personalized treatment plans; in the retail industry, product descriptions in natural language can be combined with historical sales data for demand forecasting; in the financial field, the combination of text reports with historical price and volume data is crucial for predicting asset prices. However, taking advantage of multimodal learning to process structured and unstructured data faces multiple challenges, including an increase in the number of different modalities, input size, and data heterogeneity. Most existing multimodal models mainly utilize the shared semantics between modalities through joint pre - training in the shared semantic space, but these applications are usually limited to unstructured data and have limited support for structured data. Although some models attempt to achieve a joint representation of structured and unstructured data, their application scope is usually limited to specific tasks, such as retrieving database entries based on natural - language queries, and a large amount of engineering work is required to adapt to different downstream tasks. Therefore, the paper proposes a modular end - to - end multimodal learning method - MAGNUM (Modality - AGNostic mUltimodal Modular architecture), which aims to natively process structured and unstructured data and can flexibly use any specialized unimodal module to extract, compress, and fuse information from all available modalities. This method not only overcomes the limitations of existing methods in processing structured data but also provides a more systematic and general multimodal representation - learning framework, which is suitable for multiple industry application scenarios.

A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

Multimodal Understanding Through Correlation Maximization and Minimization

S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture

MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

MM-Lego: Modular Biomedical Multimodal Models with Minimal Fine-Tuning

Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

Deep Multimodal Data Fusion

HyperMM : Robust Multimodal Learning with Varying-sized Inputs

Meta-Transformer: A Unified Framework for Multimodal Learning

Multimodal Structure Preservation Learning

Multimodal Representation Learning by Alternating Unimodal Adaptation

3FM: Multi-modal Meta-learning for Federated Tasks

Supervised Multi-Modal Fission Learning

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

Detached and Interactive Multimodal Learning

SMIL: Multimodal Learning with Severely Missing Modality

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Deep Multimodal Learning with Missing Modality: A Survey

Multimodal Neural Databases