uTalk: Bridging the Gap Between Humans and AI

Hussam Azzuni,Sharim Jamal,Abdulmotaleb Elsaddik

2023-12-13

Abstract:Large Language Models (LLMs) have revolutionized various industries by harnessing their power to improve productivity and facilitate learning across different fields. One intriguing application involves combining LLMs with visual models to create a novel approach to Human-Computer Interaction. The core idea of this system is to create a user-friendly platform that enables people to utilize ChatGPT's features in their everyday lives. uTalk is comprised of technologies like Whisper, ChatGPT, Microsoft Speech Services, and the state-of-the-art (SOTA) talking head system SadTalker. Users can engage in human-like conversation with a digital twin and receive answers to any questions. Also, uTalk could generate content by submitting an image and input (text or audio). This system is hosted on Streamlit, where users will be prompted to provide an image to serve as their AI assistant. Then, as the input (text or audio) is provided, a set of operations will produce a video of the avatar with the precise response. This paper outlines how SadTalker's run-time has been optimized by 27.69% based on 25 frames per second (FPS) generated videos and 38.38% compared to our 20FPS generated videos. Furthermore, the integration and parallelization of SadTalker and Streamlit have resulted in a 9.8% improvement compared to the initial performance of the system.

Human-Computer Interaction

What problem does this paper attempt to address?

The main objective of this paper is to propose a framework called uTalk, which aims to enhance human-computer interaction by integrating the optimized SadTalker system with various algorithms (such as Whisper API, ChatGPT, and text-to-speech technology implemented through Azure Cognitive Services). Specifically, uTalk aims to address the following issues: 1. **Creating a user-friendly interactive platform**: By combining the capabilities of large language models (LLMs), users can converse with digital twins in a natural manner and obtain the information they need. 2. **Optimizing SadTalker's runtime**: By optimizing SadTalker at the code level, the speed of video generation has been significantly improved, reducing the runtime by approximately 27.69%. 3. **Improving user experience**: By adjusting the FPS (frames per second) to balance video smoothness and generation speed, the study found that 20FPS can provide quality close to 25FPS but significantly shorten the processing time. 4. **Integration and parallelization**: Successfully integrated SadTalker with the Streamlit platform and achieved the separation of initialization and video generation processes through modular design, further improving the system's response speed. Through these improvements, uTalk not only enhances the smoothness and practicality of human-computer interaction but also provides new possibilities for content creation.

uTalk: Bridging the Gap Between Humans and AI

Teaching Machines to Converse

GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System

ChatAnything: Facetime Chat with LLM-Enhanced Personas.

Simulating User Agents for Embodied Conversational-AI

NewsGPT: ChatGPT Integration for Robot-Reporter

Does ChatGPT and Whisper Make Humanoid Robots More Relatable?

Introducing the Talk Markup Language (TalkML):Adding a little social intelligence to industrial speech interfaces

ChatGPT: Revolutionizing User Interactions with Advanced Natural Language Processing

Towards human-like spoken dialogue generation between AI agents from written dialogue

More than Chit-Chat: Developing Robots for Small-Talk Interactions

The implementation of the cognitive theory of multimedia learning in the design and evaluation of an AI educational video assistant utilizing large language models

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru

DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity

Conversational AI-Powered Design: ChatGPT as Designer, User, and Product

ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Reflective Dialogues with a Humanoid Robot Integrated with an LLM and a Curated NLU System for Positive Behavioral Change in Older Adults

Toward a Dialogue System Using a Large Language Model to Recognize User Emotions with a Camera

Task Supportive and Personalized Human-Large Language Model Interaction: A User Study