Abstract:Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality. Our code, model, and dataset will be available at \url{<a class="link-external link-https" href="https://github.com/lzw-lzw/UnifiedMLLM" rel="external noopener nofollow">this https URL</a>}.

PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models

Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding

PolyVoice: Language Models for Speech to Speech Translation

SpeechNet: A Universal Modularized Model for Speech Processing Tasks

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Joint Speech-Text Embeddings for Multitask Speech Processing

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

A Large-Scale Evaluation of Speech Foundation Models

12-in-1: Multi-task vision and language representation learning

Multispeech: Multi-speaker text to speech with transformer

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

A Multi-Task Semantic Communication System for Natural Language Processing

SADDEL: Joint Speech Separation and Denoising Model based on Multitask Learning

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

End-to-End-Based Tibetan Multitask Speech Recognition.

STTATTS: Unified Speech-To-Text And Text-To-Speech Model

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Toward Joint Language Modeling for Speech Units and Text

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition