Abstract:While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capabilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions. We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi-step, and multi-modal tasks. During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of over 1M synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains of up to 15% in MMVet tasks involving OCR, mathematical reasoning, and spatial reasoning. Training on high-quality CoTA traces sets a new standard for complex multi-modal reasoning, highlighting the need for structured, multi-step instruction tuning in advancing open-source mutli-modal models' capabilities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Existing open - source multimodal language models perform poorly when dealing with complex, multi - step tasks that require multiple abilities. Specifically, these tasks include fine - grained recognition, visual localization, reasoning, etc., and usually require multi - step solutions. Therefore, the paper proposes a new method to improve the performance of the model on these complex tasks. ### Main problems: 1. **Limitations of existing models**: Although existing multimodal language models perform well on simple question - answering tasks, they often fail when dealing with complex problems (such as tasks involving OCR, mathematical reasoning, and spatial reasoning). 2. **Lack of high - quality training data**: Current instruction - tuning datasets mainly contain single - step solutions and short, direct answers, lacking structured multi - step guidance, which makes it difficult for the model to learn complex reasoning and action paths. ### Solutions: To solve these problems, the paper proposes TACO (Learning Multi - modal Action Models with Synthetic Chains - of - Thought - and - Action), a multimodal large - action model for complex multimodal tasks. The main features of TACO are as follows: 1. **Generate synthetic Chains - of - Thought - and - Action (CoTA)**: Generate large - scale synthetic CoTA data through GPT - 4o and Python programs to provide high - quality training samples. 2. **Use external tools to perform intermediate steps**: TACO can call external tools (such as OCR, depth estimation, and calculators) during the reasoning process to perform intermediate steps, thus solving problems more accurately. 3. **High - quality data filtering and mixing**: Through strict screening and mixing of the generated data, ensure that the final dataset used is of high quality and high efficiency. ### Experimental results: - TACO has an average performance improvement of 3.6% in 8 benchmark tests, especially in the MMVet task involving OCR, mathematical reasoning, and spatial reasoning, with a performance improvement of up to 15%. - By using high - quality CoTA data, TACO can significantly improve the model's reasoning and action - calling ability, surpassing the baseline model that only uses direct answers for instruction - tuning. ### Conclusion: By introducing synthetic Chains - of - Thought - and - Action data (CoTA), the paper successfully solves the deficiencies of existing models in complex multimodal tasks and provides a new direction for future research.

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss

TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding

MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting

M^3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models

M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

xCoT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning