Abstract:Multimodal conversational agents are highly desirable because they offer natural and human-like interaction. However, there is a lack of comprehensive end-to-end solutions to support collaborative development and benchmarking. While proprietary systems like GPT-4o and Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balancing latency, accuracy, cost, and data privacy. To better understand and quantify these issues, we developed OpenOmni, an open-source, end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models, along with the ability to integrate customized models. OpenOmni supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking. This flexible framework allows researchers to customize the pipeline, focusing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly enhance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction. Our demonstration video is available <a class="link-external link-https" href="https://www.youtube.com/watch?v=zaSiT3clWqY" rel="external noopener nofollow">this https URL</a>, demo is available via <a class="link-external link-https" href="https://openomni.ai4wa.com" rel="external noopener nofollow">this https URL</a>, code is available via <a class="link-external link-https" href="https://github.com/AI4WA/OpenOmniFramework" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiencies in the development and benchmarking of current multimodal conversational agents. Specifically, the paper focuses on the following issues: 1. **Lack of comprehensive end - to - end solutions**: - Currently, although some proprietary systems (such as GPT - 4o and Gemini) have demonstrated the ability to integrate audio, video, and text, and the response time is between 200 - 250 milliseconds, there are still challenges in balancing latency, accuracy, cost, and data privacy. - There is a lack of a comprehensive, open - source end - to - end multimodal conversational agent implementation, which restricts research and innovation. 2. **Data privacy and cost issues**: - The use of proprietary systems usually requires uploading data to the server through a paid API, which raises data privacy issues. For example, GPT - 4 series solutions are closed - source, so users must upload data to the cloud, which brings privacy risks and additional costs. 3. **Lack of performance evaluation and benchmarking tools**: - Existing multimodal conversation systems lack effective evaluation and benchmarking tools, making it difficult to quickly verify concepts and identify bottlenecks. In order to support the rapid development of this field, it is crucial to establish robust evaluation and benchmarking protocols. To solve these problems, the paper proposes the OpenOmni framework, which is an open - source, end - to - end multimodal pipeline that integrates advanced technologies such as Speech - to - Text, Emotion Detection, Retrieval Augmented Generation (RAG), Large Language Models (LLMs), and Text - to - Speech (TTS). OpenOmni supports local and cloud - based deployment, ensures data privacy, and supports latency and accuracy benchmarking. In addition, this framework allows researchers to customize the pipeline, focus on actual bottlenecks, and promote rapid proof - of - concept development. ### Formula examples When describing latency and accuracy, the following formulas can be used to represent performance metrics: - **Average Latency**: $\text{Average Latency}=\frac{\sum_{i = 1}^{n}t_i}{n}$, where $t_i$ is the latency time of the $i$-th interaction and $n$ is the total number of interactions. - **Accuracy Score**: $\text{Accuracy Score}=\frac{\sum_{i = 1}^{m}s_i}{m}$, where $s_i$ is the score of the $i$-th interaction and $m$ is the total number of scoring times. These formulas help to quantify and evaluate the performance of multimodal conversational agents.

OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data

OmniBench: Towards The Future of Universal Omni-Language Models

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Conversational AI Multi-Agent Interoperability, Universal Open APIs for Agentic Natural Language Multimodal Communications

OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

OpenVNA: A Framework for Analyzing the Behavior of Multimodal Language Understanding System under Noisy Scenarios

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

OmniDialog: An Omnipotent Pre-training Model for Task-Oriented Dialogue System

COMMA: A Communicative Multimodal Multi-Agent Benchmark

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance

OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

VITA: Towards Open-Source Interactive Omni Multimodal LLM