Abstract:Automated Machine Learning (AutoML) offers a promising approach to streamline the training of machine learning models. However, existing AutoML frameworks are often limited to unimodal scenarios and require extensive manual configuration. Recent advancements in Large Language Models (LLMs) have showcased their exceptional abilities in reasoning, interaction, and code generation, presenting an opportunity to develop a more automated and user-friendly framework. To this end, we introduce AutoM3L, an innovative Automated Multimodal Machine Learning framework that leverages LLMs as controllers to automatically construct multimodal training pipelines. AutoM3L comprehends data modalities and selects appropriate models based on user requirements, providing automation and interactivity. By eliminating the need for manual feature engineering and hyperparameter optimization, our framework simplifies user engagement and enables customization through directives, addressing the limitations of previous rule-based AutoML approaches. We evaluate the performance of AutoM3L on six diverse multimodal datasets spanning classification, regression, and retrieval tasks, as well as a comprehensive set of unimodal datasets. The results demonstrate that AutoM3L achieves competitive or superior performance compared to traditional rule-based AutoML methods. Furthermore, a user study highlights the user-friendliness and usability of our framework, compared to the rule-based AutoML methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing Automated Machine Learning (AutoML) frameworks when dealing with multimodal data. Specifically, existing AutoML solutions mainly focus on unimodal data and require a large amount of manual configuration. These problems limit the flexibility and ease - of - use of these frameworks, especially when it is necessary to process heterogeneous data from different sources, such as combining tabular product information with related image and text descriptions, or integrating data types such as users' photos, texts, and transaction records in the financial field. In addition, existing multimodal AutoML tools, such as AutoGluon, although they have attempted to solve some of the problems, still have disadvantages such as a low degree of automation, a steep user learning curve, limited adaptability, and poor scalability. To solve the above problems, the paper proposes an innovative framework named AutoM3L, which uses large - language models (LLMs) as controllers to automatically construct multimodal training pipelines. The main features of AutoM3L include: 1. **Modal Inference**: Automatically identify the modality of each attribute in the structured table through large - language models, simplifying the modality identification process. 2. **Automatic Feature Engineering**: Use LLM to intelligently filter irrelevant or redundant attributes and perform data filling, reducing manual intervention and improving the quality of input data. 3. **Model Selection**: Automatically select appropriate models according to user requirements and data modalities, providing highly customized solutions. 4. **Pipeline Assembly**: Generate executable scripts to achieve cross - modal feature fusion, simplifying the construction of multimodal machine - learning pipelines. 5. **Hyperparameter Optimization**: Combine the suggestions automatically generated by LLM and external API calls to achieve efficient hyperparameter tuning and eliminate the need for manual exploration. Through these functions, AutoM3L aims to simplify user participation, improve the automation level and user experience of multimodal machine - learning tasks. Meanwhile, the experimental results on multiple multimodal and unimodal datasets show that AutoM3L can be comparable to or even outperform traditional rule - based AutoML methods in performance.

AutoM3L: An Automated Multimodal Machine Learning Framework with Large Language Models

UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models

AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

InfMLLM: A Unified Framework for Visual-Language Tasks.

A Survey on Multimodal Large Language Models for Autonomous Driving

OneLLM: One Framework to Align All Modalities with Language

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

AutoGluon-Multimodal (AutoMM): Supercharging Multimodal AutoML with Foundation Models

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

MM-LLMs: Recent Advances in MultiModal Large Language Models

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

AutoProteinEngine: A Large Language Model Driven Agent Framework for Multimodal AutoML in Protein Engineering

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

AutoManual: Constructing Instruction Manuals by LLM Agents via Interactive Environmental Learning

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

A Superalignment Framework in Autonomous Driving with Large Language Models