OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents

Qiang Sun,Yuanyi Luo,Sirui Li,Wenxiao Zhang,Wei Liu
2024-08-06
Abstract:Multimodal conversational agents are highly desirable because they offer natural and human-like interaction. However, there is a lack of comprehensive end-to-end solutions to support collaborative development and benchmarking. While proprietary systems like GPT-4o and Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balancing latency, accuracy, cost, and data privacy. To better understand and quantify these issues, we developed OpenOmni, an open-source, end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models, along with the ability to integrate customized models. OpenOmni supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking. This flexible framework allows researchers to customize the pipeline, focusing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly enhance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction. Our demonstration video is available <a class="link-external link-https" href="https://www.youtube.com/watch?v=zaSiT3clWqY" rel="external noopener nofollow">this https URL</a>, demo is available via <a class="link-external link-https" href="https://openomni.ai4wa.com" rel="external noopener nofollow">this https URL</a>, code is available via <a class="link-external link-https" href="https://github.com/AI4WA/OpenOmniFramework" rel="external noopener nofollow">this https URL</a>.
Human-Computer Interaction,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies in the development and benchmarking of current multimodal conversational agents. Specifically, the paper focuses on the following issues: 1. **Lack of comprehensive end - to - end solutions**: - Currently, although some proprietary systems (such as GPT - 4o and Gemini) have demonstrated the ability to integrate audio, video, and text, and the response time is between 200 - 250 milliseconds, there are still challenges in balancing latency, accuracy, cost, and data privacy. - There is a lack of a comprehensive, open - source end - to - end multimodal conversational agent implementation, which restricts research and innovation. 2. **Data privacy and cost issues**: - The use of proprietary systems usually requires uploading data to the server through a paid API, which raises data privacy issues. For example, GPT - 4 series solutions are closed - source, so users must upload data to the cloud, which brings privacy risks and additional costs. 3. **Lack of performance evaluation and benchmarking tools**: - Existing multimodal conversation systems lack effective evaluation and benchmarking tools, making it difficult to quickly verify concepts and identify bottlenecks. In order to support the rapid development of this field, it is crucial to establish robust evaluation and benchmarking protocols. To solve these problems, the paper proposes the OpenOmni framework, which is an open - source, end - to - end multimodal pipeline that integrates advanced technologies such as Speech - to - Text, Emotion Detection, Retrieval Augmented Generation (RAG), Large Language Models (LLMs), and Text - to - Speech (TTS). OpenOmni supports local and cloud - based deployment, ensures data privacy, and supports latency and accuracy benchmarking. In addition, this framework allows researchers to customize the pipeline, focus on actual bottlenecks, and promote rapid proof - of - concept development. ### Formula examples When describing latency and accuracy, the following formulas can be used to represent performance metrics: - **Average Latency**: $\text{Average Latency}=\frac{\sum_{i = 1}^{n}t_i}{n}$, where $t_i$ is the latency time of the $i$-th interaction and $n$ is the total number of interactions. - **Accuracy Score**: $\text{Accuracy Score}=\frac{\sum_{i = 1}^{m}s_i}{m}$, where $s_i$ is the score of the $i$-th interaction and $m$ is the total number of scoring times. These formulas help to quantify and evaluate the performance of multimodal conversational agents.