Abstract:The Text Extraction of the Audio from the Video plays an important role in multimedia editing and processing. As a popular open source toolkit, Whisper performs fast in human voice recognition. However, the recognition performance is dependent on the computing resource, which makes the low computing memory running Whisper become difficult. Our paper presents an available solution to extract the human voice from the video and gain the high quality text generation from the voice. The generated voice can be used in video language translation and translated voice simulation. To improve the extraction and transform quality of human voice, we present ecVoice, a method using the idioms similarity computation and analysis to improve the quality of audio text extraction. Relative experiments are held to verify that the ecVoice can improve the idiom grammar correction rate to 90\% on average. The method is simple but fast which means this method will cause less bad influence of consuming computing resources when improving the voice recognition rate. Our method and solution can significantly enhance the Whisper recognition with low computing memory.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the key problem of extracting high - quality audio text from videos, especially improving the accuracy and processing speed of speech recognition under the condition of limited computing resources. Specifically, the paper proposes a method named ecVoice to optimize the quality of text extraction from audio through idiom similarity replacement. The following are the main problems the paper attempts to solve: 1. **Improve the accuracy of speech recognition**: - Current speech recognition technologies (such as Whisper) perform poorly under the condition of limited computing resources, especially when running in a low - memory environment. - By introducing idiom similarity analysis and replacement methods, ecVoice can significantly improve the accuracy of speech recognition, especially in terms of grammar correction. 2. **Optimize the quality of speech - text extraction**: - Traditional speech recognition methods may make mistakes when dealing with complex contexts (such as sentences containing idioms). - ecVoice improves the quality of audio - text extraction through idiom similarity calculation and analysis, making the generated text more accurate and natural. 3. **Reduce computing resource consumption**: - Traditional methods rely on large neural network models, which require a large amount of computing resources. - ecVoice adopts a simple but fast method, which can reduce the consumption of computing resources while improving the speech recognition rate. 4. **Achieve efficient speech translation and simulation**: - The high - quality audio text extracted can be used for video language translation and simulated translated speech. - This provides more efficient support for applications such as multimedia editing and game design. ### Summary This paper mainly solves the problem of how to improve the quality and accuracy of text extraction from video audio through idiom similarity replacement methods under the condition of limited computing resources. Through a series of experiments, the ecVoice method can significantly improve the performance of speech recognition, especially its performance in grammar correction is particularly prominent.

ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation

Looking and Listening: Audio Guided Text Recognition

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Turning Whisper into Real-Time Transcription System

Transcribing Educational Videos Using Whisper: A preliminary study on using AI for transcribing educational videos

Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text

ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams

EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals

OpenVoice: Versatile Instant Voice Cloning

Lip Assistant: Visualize Speech For Hearing Impaired People In Multimedia Services

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Whispy: Adapting STT Whisper Models to Real-Time Environments