Abstract:The emergence of commercial tools for real-time performance-based 2D animation has enabled 2D characters to appear on live broadcasts and streaming platforms. A key requirement for live animation is fast and accurate lip sync that allows characters to respond naturally to other actors or the audience through the voice of a human performer. In this work, we present a deep learning based interactive system that automatically generates live lip sync for layered 2D characters using a Long Short Term Memory (LSTM) model. Our system takes streaming audio as input and produces viseme sequences with less than 200ms of latency (including processing time). Our contributions include specific design decisions for our feature definition and LSTM configuration that provide a small but useful amount of lookahead to produce accurate lip sync. We also describe a data augmentation procedure that allows us to achieve good results with a very small amount of hand-animated training data (13-20 minutes). Extensive human judgement experiments show that our results are preferred over several competing methods, including those that only support offline (non-live) processing. Video summary and supplementary results at GitHub link: <a class="link-external link-https" href="https://github.com/deepalianeja/CharacterLipSync2D" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate high - quality lip sync in real - time 2D animation. Specifically, the author proposes a deep - learning - based method, using Long Short - Term Memory Network (LSTM) to convert real - time audio streams into discrete viseme sequences of 2D characters, in order to achieve real - time lip sync with low latency (less than 200 milliseconds). ### Background of the Paper The traditional 2D animation production process is very labor - intensive, requiring animators to draw each frame manually or specify key frames and motion curves. However, with the emergence of real - time 2D animation, human performers can control cartoon characters in real - time, enabling them to interact directly with other actors and the audience. In this case, high - quality lip sync becomes particularly important because it is a key factor in making the character look like it is responding naturally to other actors or the audience. ### Specific Challenges of the Problem 1. **Real - Time Performance**: Real - time animation requires the system to be able to operate with extremely low latency, usually requiring processing to be completed within 200 milliseconds. 2. **No Accurate Script**: Real - time interactive performances usually do not have a strict pre - defined script, so the system cannot rely on accurate Speech - to - Text (STT) algorithms. 3. **Difficulty in Data Collection**: Manually generating lip - sync data is very time - consuming and costly, usually requiring 5 to 7 hours of work per minute of speech. 4. **Style Diversity**: Different animators have different artistic styles in choosing visemes and transition timing, so training a general - purpose model is very challenging. ### Solution The author proposes an LSTM - based real - time processing pipeline that can convert real - time audio streams into corresponding viseme sequences. The specific methods are as follows: 1. **Feature Representation**: - Use Mel - Frequency Cepstral Coefficients (MFCC) and their first - order derivatives as input features. - Add log - energy and its derivative as additional features. - Estimate the derivatives by calculating the MFCC values before and after two windows, providing a small amount of future information. 2. **Time Offset**: - Introduce temporal shift, so that the model can access future feature vectors when predicting the current viseme, thereby improving accuracy. 3. **Data Augmentation**: - Utilize multiple speaker recordings in the TIMIT dataset, and align the recordings of different speakers to the reference recording through Dynamic Time Warping (DTW), thereby increasing the diversity of the training data. 4. **Filtering Mechanism**: - Remove short - term noise through small - range look - ahead filtering. - Ensure that each viseme is displayed for at least two frames to avoid the flickering effect of single - frame visemes. ### Experimental Results The author compared their method with several baseline methods, including online and offline automatic lip - sync of commercial 2D animation tools, through human - preference experiments. The experimental results show that their method is superior to other methods in both real - time performance and lip - sync quality, and can achieve good results even when using less manually - labeled data (13 - 20 minutes). ### Conclusion This paper successfully solves the problem of high - quality lip - sync in real - time 2D animation. The proposed method is not only technically feasible but also performs well in practical applications, and can significantly reduce the time and cost of animation production.

Real-Time Lip Sync for Live 2D Animation

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

A Novel Lip Synchronization Approach for Games and Virtual Environments

A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

VisemeNet: Audio-Driven Animator-Centric Speech Animation

MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Real-time Lip Synchronization Based on Hidden Markov Models

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

FlexLip: A Controllable Text-to-Lip System

Real-time speech-driven lip synchronization

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning

Audio2Rig: Artist-oriented deep learning tool for facial animation

LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

MILG: Realistic Lip-Sync Video Generation with Audio-Modulated Image Inpainting

Lip syncing method for realistic expressive 3D face model