AutoM3L: An Automated Multimodal Machine Learning Framework with Large Language Models

Daqin Luo,Chengjian Feng,Yuxuan Nong,Yiqing Shen
2024-08-02
Abstract:Automated Machine Learning (AutoML) offers a promising approach to streamline the training of machine learning models. However, existing AutoML frameworks are often limited to unimodal scenarios and require extensive manual configuration. Recent advancements in Large Language Models (LLMs) have showcased their exceptional abilities in reasoning, interaction, and code generation, presenting an opportunity to develop a more automated and user-friendly framework. To this end, we introduce AutoM3L, an innovative Automated Multimodal Machine Learning framework that leverages LLMs as controllers to automatically construct multimodal training pipelines. AutoM3L comprehends data modalities and selects appropriate models based on user requirements, providing automation and interactivity. By eliminating the need for manual feature engineering and hyperparameter optimization, our framework simplifies user engagement and enables customization through directives, addressing the limitations of previous rule-based AutoML approaches. We evaluate the performance of AutoM3L on six diverse multimodal datasets spanning classification, regression, and retrieval tasks, as well as a comprehensive set of unimodal datasets. The results demonstrate that AutoM3L achieves competitive or superior performance compared to traditional rule-based AutoML methods. Furthermore, a user study highlights the user-friendliness and usability of our framework, compared to the rule-based AutoML methods.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing Automated Machine Learning (AutoML) frameworks when dealing with multimodal data. Specifically, existing AutoML solutions mainly focus on unimodal data and require a large amount of manual configuration. These problems limit the flexibility and ease - of - use of these frameworks, especially when it is necessary to process heterogeneous data from different sources, such as combining tabular product information with related image and text descriptions, or integrating data types such as users' photos, texts, and transaction records in the financial field. In addition, existing multimodal AutoML tools, such as AutoGluon, although they have attempted to solve some of the problems, still have disadvantages such as a low degree of automation, a steep user learning curve, limited adaptability, and poor scalability. To solve the above problems, the paper proposes an innovative framework named AutoM3L, which uses large - language models (LLMs) as controllers to automatically construct multimodal training pipelines. The main features of AutoM3L include: 1. **Modal Inference**: Automatically identify the modality of each attribute in the structured table through large - language models, simplifying the modality identification process. 2. **Automatic Feature Engineering**: Use LLM to intelligently filter irrelevant or redundant attributes and perform data filling, reducing manual intervention and improving the quality of input data. 3. **Model Selection**: Automatically select appropriate models according to user requirements and data modalities, providing highly customized solutions. 4. **Pipeline Assembly**: Generate executable scripts to achieve cross - modal feature fusion, simplifying the construction of multimodal machine - learning pipelines. 5. **Hyperparameter Optimization**: Combine the suggestions automatically generated by LLM and external API calls to achieve efficient hyperparameter tuning and eliminate the need for manual exploration. Through these functions, AutoM3L aims to simplify user participation, improve the automation level and user experience of multimodal machine - learning tasks. Meanwhile, the experimental results on multiple multimodal and unimodal datasets show that AutoM3L can be comparable to or even outperform traditional rule - based AutoML methods in performance.