RITA: A Real-time Interactive Talking Avatars Framework

Wuxinlin Cheng,Cheng Wan,Yupeng Cao,Sihan Chen
2024-06-19
Abstract:RITA presents a high-quality real-time interactive framework built upon generative models, designed with practical applications in mind. Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions. By leveraging the latest advancements in generative modeling, we have developed a versatile platform that not only enhances the user experience through dynamic conversational avatars but also opens new avenues for applications in virtual reality, online education, and interactive gaming. This work showcases the potential of integrating computer vision and natural language processing technologies to create immersive and interactive digital personas, pushing the boundaries of how we interact with digital content.
Computer Vision and Pattern Recognition,Artificial Intelligence,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to generate high - quality talking avatars from static images in real - time interactive scenarios. Specifically, although the existing audio - driven talking avatar generation models perform well in synchronizing lip movements and audio, they face challenges such as high latency and large computational resource requirements in real - time applications. Therefore, the practicality of these models in application scenarios requiring real - time interaction (such as virtual reality, online education, interactive games, etc.) is limited. ### Main contributions of the paper 1. **Development of the RITA framework**: - RITA is an end - to - end framework that can generate real - time interactive talking avatars from static images, using generative models and large language models (LLMs) to achieve natural avatar - user conversations. 2. **New real - time processing pipeline**: - By optimizing the generative model and introducing an efficient real - time processing mechanism, the latency in the existing talking avatar models is significantly reduced, achieving seamless and high - fidelity interaction. 3. **Integration of content generation in LLMs**: - The large language models are integrated for content generation, enabling the avatars to have coherent and context - relevant conversations, thus expanding the application range of talking avatars, such as virtual customer service agents and personalized digital companions. 4. **Empirical evidence of performance improvement**: - Through comparative analysis and experiments, it is proved that RITA is superior to existing methods in terms of latency, quality and applicability, showing its improvements in speed, interaction quality and user experience. ### Specific methods to solve the problem - **Base frame generation**: Use a flexible audio - driven model to generate initial video frames, and generate hyper - parameters by embedding audio signals to ensure that the generated frames are synchronized with the audio. - **Dynamic frame matching**: Match new audio inputs with pre - generated frames through an efficient algorithm, avoid continuous video regeneration, and maintain visual coherence through an intelligent frame reduction strategy. - **Real - time video interpolation**: Adopt advanced real - time video frame interpolation techniques to fill the gaps after frame reduction and restore the natural smoothness of avatar movements. Through these innovations, RITA not only solves the limitations of existing models in real - time applications but also sets a new benchmark for future talking avatar technologies.