Abstract:Speech enhancement is crucial in human-computer interaction, especially for ubiquitous devices. Ultrasound-based speech enhancement has emerged as an attractive choice because of its superior ubiquity and performance. However, inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition makes existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes correspondence between visual and ultrasonic modalities by leveraging audible audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, significantly outperforming state-of-the-art ultrasound-based speech enhancement baselines. USpeech is open-sourced at <a class="link-external link-https" href="https://github.com/aiot-lab/USpeech/" rel="external noopener nofollow">this https URL</a>.

Speech Enhancement Using Open-Unmix Music Source Separation Architecture

U-NET: A Supervised Approach for Monaural Source Separation

Speech Intelligibility Based Enhancement System Using Modified Deep Neural Network and Adaptive Multi-band Spectral Subtraction

Benchmarks and leaderboards for sound demixing tasks

Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

Explicit-memory multiresolution adaptive framework for speech and music separation

OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation

A Deep-Learning Based Framework for Source Separation, Analysis, and Synthesis of Choral Ensembles

MUSIC REMIXING AND UPMIXING USING SOURCE SEPARATION

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

Blind Source Separation and Denoising of Underwater Acoustic Signals

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

Listen and Look: Audio–Visual Matching Assisted Speech Source Separation

An automatic mixing speech enhancement system for multi-track audio

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Towards Solving Cocktail-Party: The First Method to Build a Realistic Dataset with Ground Truths for Speech Separation

An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation

Source Separation & Automatic Transcription for Music

End-to-end Music-mixed Speech Recognition

USpeech: Ultrasound-Enhanced Speech with Minimal Human Effort via Cross-Modal Synthesis