Abstract:As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence. In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications. The code for the data collection process utilized in this paper is publicly available at: <a class="link-external link-https" href="https://github.com/microsoft/UFO/tree/main/dataflow" rel="external noopener nofollow">this https URL</a>, and comprehensive documentation can be found at <a class="link-external link-https" href="https://microsoft.github.io/UFO/dataflow/overview/" rel="external noopener nofollow">this https URL</a>.

Multimodal Large Models Are Effective Action Anticipators

Grounding Multimodal Large Language Models in Actions

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

VideoLLM: Modeling Video Sequence with Large Language Models

PALM: Predicting Actions through Language Models

TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction

ST-LLM: Large Language Models Are Effective Temporal Learners

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Understanding Long Videos with Multimodal Language Models

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

LLMs are Good Action Recognizers

Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks?

Temporal Grounding of Activities using Multimodal Large Language Models

Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network

Large Multimodal Agents: A Survey

Large Action Models: From Inception to Implementation

Large Motion Model for Unified Multi-Modal Motion Generation

Complex Video Action Reasoning Via Learnable Markov Logic Network

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs

LMEye: An Interactive Perception Network for Large Language Models