Abstract:RITA presents a high-quality real-time interactive framework built upon generative models, designed with practical applications in mind. Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions. By leveraging the latest advancements in generative modeling, we have developed a versatile platform that not only enhances the user experience through dynamic conversational avatars but also opens new avenues for applications in virtual reality, online education, and interactive gaming. This work showcases the potential of integrating computer vision and natural language processing technologies to create immersive and interactive digital personas, pushing the boundaries of how we interact with digital content.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to generate high - quality talking avatars from static images in real - time interactive scenarios. Specifically, although the existing audio - driven talking avatar generation models perform well in synchronizing lip movements and audio, they face challenges such as high latency and large computational resource requirements in real - time applications. Therefore, the practicality of these models in application scenarios requiring real - time interaction (such as virtual reality, online education, interactive games, etc.) is limited. ### Main contributions of the paper 1. **Development of the RITA framework**: - RITA is an end - to - end framework that can generate real - time interactive talking avatars from static images, using generative models and large language models (LLMs) to achieve natural avatar - user conversations. 2. **New real - time processing pipeline**: - By optimizing the generative model and introducing an efficient real - time processing mechanism, the latency in the existing talking avatar models is significantly reduced, achieving seamless and high - fidelity interaction. 3. **Integration of content generation in LLMs**: - The large language models are integrated for content generation, enabling the avatars to have coherent and context - relevant conversations, thus expanding the application range of talking avatars, such as virtual customer service agents and personalized digital companions. 4. **Empirical evidence of performance improvement**: - Through comparative analysis and experiments, it is proved that RITA is superior to existing methods in terms of latency, quality and applicability, showing its improvements in speed, interaction quality and user experience. ### Specific methods to solve the problem - **Base frame generation**: Use a flexible audio - driven model to generate initial video frames, and generate hyper - parameters by embedding audio signals to ensure that the generated frames are synchronized with the audio. - **Dynamic frame matching**: Match new audio inputs with pre - generated frames through an efficient algorithm, avoid continuous video regeneration, and maintain visual coherence through an intelligent frame reduction strategy. - **Real - time video interpolation**: Adopt advanced real - time video frame interpolation techniques to fill the gaps after frame reduction and restore the natural smoothness of avatar movements. Through these innovations, RITA not only solves the limitations of existing models in real - time applications but also sets a new benchmark for future talking avatar technologies.

RITA: A Real-time Interactive Talking Avatars Framework

Real-time Facial Animation with Image-Based Dynamic Avatars.

RealtimeGen: an Intervenable AI Image Generation System for Commercial Digital Art Asset Creators

AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents

TalkingAndroid: an Interactive, Multimodal and Real-Time Talking Avatar Application on Mobile Phones

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

AvatarReX: Real-time Expressive Full-body Avatars

Human Expressions Interaction Between Avatar and Virtual World

Expressive Talking Avatars

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

Styleself: Style-Controllable High-Fidelity Conversational Virtual Avatars Generation

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Real-time Conversion from a Single 2D Face Image to a 3D Text-Driven Emotive Audio-Visual Avatar

Digital Avatars: Framework Development and Their Evaluation

Make-A-Character: High Quality Text-to-3D Character Generation within Minutes