Looking and Listening: Audio Guided Text Recognition

Wenwen Yu,Mingyu Liu,Biao Yang,Enming Zhang,Deqiang Jiang,Xing Sun,Yuliang Liu,Xiang Bai
2023-06-06
Abstract:Text recognition in the wild is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest vision and language processing are effective for scene text recognition. Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches. In fact, the content of the text and its audio are naturally corresponding to each other, i.e., a single character error may result in a clear different pronunciation. In this paper, we propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction to guide the scene text recognition, which only participates in the training phase and brings no extra cost during the inference stage. The underlying principle of AudioOCR can be easily applied to the existing approaches. Experiments using 7 previous scene text recognition methods on 12 existing regular, irregular, and occluded benchmarks demonstrate our proposed method can bring consistent improvement. More importantly, through our experimentation, we show that AudioOCR possesses a generalizability that extends to more challenging scenarios, including recognizing non-English text, out-of-vocabulary words, and text with various accents. Code will be available at <a class="link-external link-https" href="https://github.com/wenwenyu/AudioOCR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the problem of text recognition in wild environments, particularly the challenges in handling editing errors such as adding, deleting, or replacing characters. Specifically, the paper proposes a method called **AudioOCR**, which utilizes audio information to assist in scene text recognition. #### Main Contributions: 1. **Proposing the AudioOCR Module**: Predicts Mel spectrogram sequences through a Transformer-based probabilistic audio decoder to guide scene text recognition. 2. **Simple and Effective Plug-in Design**: AudioOCR can be easily integrated into existing recognition methods, introducing minimal computational overhead during the training phase and no additional cost during the inference phase. 3. **Wide Applicability**: Experiments show that AudioOCR significantly improves the recognition performance of existing methods on regular, irregular, and occluded datasets. It also demonstrates good generalization ability, applicable to non-English texts, out-of-vocabulary words, and texts with various accents. #### Method Overview: - **Audio Decoder**: Includes Prenet, Visual-Audio Decoder, and Mel Linear layers to extract audio modality information from images. - **Joint Training Strategy**: Combines recognition loss and audio loss for joint training to optimize overall model performance. Through the above methods, the paper addresses the limitations of traditional text recognition methods in handling editing errors and validates the importance of audio information in improving recognition accuracy.