A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

Marco D Alessandro,Enrique Calabrés,Mikel Elkano
2024-03-08
Abstract:Multimodal learning is a rapidly growing research field that has revolutionized multitasking and generative modeling in AI. While much of the research has focused on dealing with unstructured data (e.g., language, images, audio, or video), structured data (e.g., tabular data, time series, or signals) has received less attention. However, many industry-relevant use cases involve or can be benefited from both types of data. In this work, we propose a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data. MAGNUM is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of simultaneously processing structured data (such as tabular data, time series or signals) and unstructured data (such as language, image, audio or video) in multimodal learning. Although much of the current research mainly focuses on unstructured data, many industry - related application scenarios actually require the combination of these two types of data. For example, in healthcare, the combination of patient records (structured data) with diagnostic images and doctor's notes (unstructured data) can improve diagnostic accuracy and develop personalized treatment plans; in the retail industry, product descriptions in natural language can be combined with historical sales data for demand forecasting; in the financial field, the combination of text reports with historical price and volume data is crucial for predicting asset prices. However, taking advantage of multimodal learning to process structured and unstructured data faces multiple challenges, including an increase in the number of different modalities, input size, and data heterogeneity. Most existing multimodal models mainly utilize the shared semantics between modalities through joint pre - training in the shared semantic space, but these applications are usually limited to unstructured data and have limited support for structured data. Although some models attempt to achieve a joint representation of structured and unstructured data, their application scope is usually limited to specific tasks, such as retrieving database entries based on natural - language queries, and a large amount of engineering work is required to adapt to different downstream tasks. Therefore, the paper proposes a modular end - to - end multimodal learning method - MAGNUM (Modality - AGNostic mUltimodal Modular architecture), which aims to natively process structured and unstructured data and can flexibly use any specialized unimodal module to extract, compress, and fuse information from all available modalities. This method not only overcomes the limitations of existing methods in processing structured data but also provides a more systematic and general multimodal representation - learning framework, which is suitable for multiple industry application scenarios.