uTalk: Bridging the Gap Between Humans and AI

Hussam Azzuni,Sharim Jamal,Abdulmotaleb Elsaddik
2023-12-13
Abstract:Large Language Models (LLMs) have revolutionized various industries by harnessing their power to improve productivity and facilitate learning across different fields. One intriguing application involves combining LLMs with visual models to create a novel approach to Human-Computer Interaction. The core idea of this system is to create a user-friendly platform that enables people to utilize ChatGPT's features in their everyday lives. uTalk is comprised of technologies like Whisper, ChatGPT, Microsoft Speech Services, and the state-of-the-art (SOTA) talking head system SadTalker. Users can engage in human-like conversation with a digital twin and receive answers to any questions. Also, uTalk could generate content by submitting an image and input (text or audio). This system is hosted on Streamlit, where users will be prompted to provide an image to serve as their AI assistant. Then, as the input (text or audio) is provided, a set of operations will produce a video of the avatar with the precise response. This paper outlines how SadTalker's run-time has been optimized by 27.69% based on 25 frames per second (FPS) generated videos and 38.38% compared to our 20FPS generated videos. Furthermore, the integration and parallelization of SadTalker and Streamlit have resulted in a 9.8% improvement compared to the initial performance of the system.
Human-Computer Interaction
What problem does this paper attempt to address?
The main objective of this paper is to propose a framework called uTalk, which aims to enhance human-computer interaction by integrating the optimized SadTalker system with various algorithms (such as Whisper API, ChatGPT, and text-to-speech technology implemented through Azure Cognitive Services). Specifically, uTalk aims to address the following issues: 1. **Creating a user-friendly interactive platform**: By combining the capabilities of large language models (LLMs), users can converse with digital twins in a natural manner and obtain the information they need. 2. **Optimizing SadTalker's runtime**: By optimizing SadTalker at the code level, the speed of video generation has been significantly improved, reducing the runtime by approximately 27.69%. 3. **Improving user experience**: By adjusting the FPS (frames per second) to balance video smoothness and generation speed, the study found that 20FPS can provide quality close to 25FPS but significantly shorten the processing time. 4. **Integration and parallelization**: Successfully integrated SadTalker with the Streamlit platform and achieved the separation of initialization and video generation processes through modular design, further improving the system's response speed. Through these improvements, uTalk not only enhances the smoothness and practicality of human-computer interaction but also provides new possibilities for content creation.