Real-Time Lip Sync for Live 2D Animation

Deepali Aneja,Wilmot Li
DOI: https://doi.org/10.48550/arXiv.1910.08685
2019-10-19
Abstract:The emergence of commercial tools for real-time performance-based 2D animation has enabled 2D characters to appear on live broadcasts and streaming platforms. A key requirement for live animation is fast and accurate lip sync that allows characters to respond naturally to other actors or the audience through the voice of a human performer. In this work, we present a deep learning based interactive system that automatically generates live lip sync for layered 2D characters using a Long Short Term Memory (LSTM) model. Our system takes streaming audio as input and produces viseme sequences with less than 200ms of latency (including processing time). Our contributions include specific design decisions for our feature definition and LSTM configuration that provide a small but useful amount of lookahead to produce accurate lip sync. We also describe a data augmentation procedure that allows us to achieve good results with a very small amount of hand-animated training data (13-20 minutes). Extensive human judgement experiments show that our results are preferred over several competing methods, including those that only support offline (non-live) processing. Video summary and supplementary results at GitHub link: <a class="link-external link-https" href="https://github.com/deepalianeja/CharacterLipSync2D" rel="external noopener nofollow">this https URL</a>
Graphics,Computer Vision and Pattern Recognition,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality lip sync in real - time 2D animation. Specifically, the author proposes a deep - learning - based method, using Long Short - Term Memory Network (LSTM) to convert real - time audio streams into discrete viseme sequences of 2D characters, in order to achieve real - time lip sync with low latency (less than 200 milliseconds). ### Background of the Paper The traditional 2D animation production process is very labor - intensive, requiring animators to draw each frame manually or specify key frames and motion curves. However, with the emergence of real - time 2D animation, human performers can control cartoon characters in real - time, enabling them to interact directly with other actors and the audience. In this case, high - quality lip sync becomes particularly important because it is a key factor in making the character look like it is responding naturally to other actors or the audience. ### Specific Challenges of the Problem 1. **Real - Time Performance**: Real - time animation requires the system to be able to operate with extremely low latency, usually requiring processing to be completed within 200 milliseconds. 2. **No Accurate Script**: Real - time interactive performances usually do not have a strict pre - defined script, so the system cannot rely on accurate Speech - to - Text (STT) algorithms. 3. **Difficulty in Data Collection**: Manually generating lip - sync data is very time - consuming and costly, usually requiring 5 to 7 hours of work per minute of speech. 4. **Style Diversity**: Different animators have different artistic styles in choosing visemes and transition timing, so training a general - purpose model is very challenging. ### Solution The author proposes an LSTM - based real - time processing pipeline that can convert real - time audio streams into corresponding viseme sequences. The specific methods are as follows: 1. **Feature Representation**: - Use Mel - Frequency Cepstral Coefficients (MFCC) and their first - order derivatives as input features. - Add log - energy and its derivative as additional features. - Estimate the derivatives by calculating the MFCC values before and after two windows, providing a small amount of future information. 2. **Time Offset**: - Introduce temporal shift, so that the model can access future feature vectors when predicting the current viseme, thereby improving accuracy. 3. **Data Augmentation**: - Utilize multiple speaker recordings in the TIMIT dataset, and align the recordings of different speakers to the reference recording through Dynamic Time Warping (DTW), thereby increasing the diversity of the training data. 4. **Filtering Mechanism**: - Remove short - term noise through small - range look - ahead filtering. - Ensure that each viseme is displayed for at least two frames to avoid the flickering effect of single - frame visemes. ### Experimental Results The author compared their method with several baseline methods, including online and offline automatic lip - sync of commercial 2D animation tools, through human - preference experiments. The experimental results show that their method is superior to other methods in both real - time performance and lip - sync quality, and can achieve good results even when using less manually - labeled data (13 - 20 minutes). ### Conclusion This paper successfully solves the problem of high - quality lip - sync in real - time 2D animation. The proposed method is not only technically feasible but also performs well in practical applications, and can significantly reduce the time and cost of animation production.