Multimodal Representation Learning With Text and Images

Aishwarya Jayagopal,Ankireddy Monica Aiswarya,Ankita Garg,Srinivasan Kolumam Nandakumar
DOI: https://doi.org/10.48550/arXiv.2205.00142
2022-04-30
Abstract:In recent years, multimodal AI has seen an upward trend as researchers are integrating data of different types such as text, images, speech into modelling to get the best results. This project leverages multimodal AI and matrix factorization techniques for representation learning, on text and image data simultaneously, thereby employing the widely used techniques of Natural Language Processing (NLP) and Computer Vision. The learnt representations are evaluated using downstream classification and regression tasks. The methodology adopted can be extended beyond the scope of this project as it uses Auto-Encoders for unsupervised representation learning.
Machine Learning,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to learn effective representations from multi - modal data (text and image). Specifically, the paper focuses on how to use matrix factorization techniques and auto - encoders to perform representation learning on text and image data simultaneously without the need for a large amount of labeled data. This is especially important when dealing with large - scale unlabeled data sets, because manual data labeling is a time - consuming and costly process. By proposing a new end - to - end architecture, namely the Multi - Modal Encoder Decoder Architecture (MMEDA), the paper aims to overcome the limitations of existing methods, such as being able to handle only data of specific dimensions or requiring pre - trained models to extract features. The paper verifies the effectiveness of the proposed method in downstream classification and regression tasks through experiments on the Goodreads data set.