Zixian Ma,Jianguo Zhang,Zhiwei Liu,Jieyu Zhang,Juntao Tan,Manli Shu,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Caiming Xiong,Ranjay Krishna,Silvio Savarese
Abstract:While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capabilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions. We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi-step, and multi-modal tasks. During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of over 1M synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains of up to 15% in MMVet tasks involving OCR, mathematical reasoning, and spatial reasoning. Training on high-quality CoTA traces sets a new standard for complex multi-modal reasoning, highlighting the need for structured, multi-step instruction tuning in advancing open-source mutli-modal models' capabilities.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Existing open - source multimodal language models perform poorly when dealing with complex, multi - step tasks that require multiple abilities. Specifically, these tasks include fine - grained recognition, visual localization, reasoning, etc., and usually require multi - step solutions. Therefore, the paper proposes a new method to improve the performance of the model on these complex tasks.
### Main problems:
1. **Limitations of existing models**: Although existing multimodal language models perform well on simple question - answering tasks, they often fail when dealing with complex problems (such as tasks involving OCR, mathematical reasoning, and spatial reasoning).
2. **Lack of high - quality training data**: Current instruction - tuning datasets mainly contain single - step solutions and short, direct answers, lacking structured multi - step guidance, which makes it difficult for the model to learn complex reasoning and action paths.
### Solutions:
To solve these problems, the paper proposes TACO (Learning Multi - modal Action Models with Synthetic Chains - of - Thought - and - Action), a multimodal large - action model for complex multimodal tasks. The main features of TACO are as follows:
1. **Generate synthetic Chains - of - Thought - and - Action (CoTA)**: Generate large - scale synthetic CoTA data through GPT - 4o and Python programs to provide high - quality training samples.
2. **Use external tools to perform intermediate steps**: TACO can call external tools (such as OCR, depth estimation, and calculators) during the reasoning process to perform intermediate steps, thus solving problems more accurately.
3. **High - quality data filtering and mixing**: Through strict screening and mixing of the generated data, ensure that the final dataset used is of high quality and high efficiency.
### Experimental results:
- TACO has an average performance improvement of 3.6% in 8 benchmark tests, especially in the MMVet task involving OCR, mathematical reasoning, and spatial reasoning, with a performance improvement of up to 15%.
- By using high - quality CoTA data, TACO can significantly improve the model's reasoning and action - calling ability, surpassing the baseline model that only uses direct answers for instruction - tuning.
### Conclusion:
By introducing synthetic Chains - of - Thought - and - Action data (CoTA), the paper successfully solves the deficiencies of existing models in complex multimodal tasks and provides a new direction for future research.