Abstract:Hearable devices, equipped with one or more microphones, are commonly used for speech communication. Here, we consider the scenario where a hearable is used to capture the user's own voice in a noisy environment. In this scenario, own voice reconstruction (OVR) is essential for enhancing the quality and intelligibility of the recorded noisy own voice signals. In previous work, we developed a deep learning-based OVR system, aiming to reduce the amount of device-specific recordings for training by using data augmentation with phoneme-dependent models of own voice transfer characteristics. Given the limited computational resources available on hearables, in this paper we propose low-complexity variants of an OVR system based on the FT-JNF architecture and investigate the required amount of device-specific recordings for effective data augmentation and fine-tuning. Simulation results show that the proposed OVR system considerably improves speech quality, even under constraints of low complexity and a limited amount of device-specific recordings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the recording quality and intelligibility through a low - complexity Own Voice Reconstruction (OVR) system when using hearables to capture the user's own voice in a noisy environment. Specifically: 1. **Problem Background**: - In a noisy environment, hearables (such as smart earplugs) are usually used for voice communication. - The external microphone will capture environmental noise and the user's own - voice signal. - Although the in - ear microphone benefits from the ear canal blockage to reduce environmental noise, the recorded own - voice signal has problems such as low - frequency amplification, bandwidth limitation, and body - generated noise. 2. **Existing Challenges**: - Although previous deep - learning methods are effective, they have high computational complexity and a large number of parameters, and are not suitable for resource - limited hearables. - Training the OR system requires a large number of device - specific own - voice signals, which are difficult to obtain in practical applications. 3. **Research Objectives**: - Propose a low - complexity OVR system based on the FT - JNF architecture to reduce the computational resource requirements. - Explore the number of device - specific own - voice signals required in the data augmentation and fine - tuning processes to ensure that the voice quality can be effectively improved even with a small number of device - specific recordings. 4. **Solutions**: - Design a low - complexity OVR system using the Frequency and Time Joint Nonlinear Filter (FT - JNF) architecture. - Generate more training data through a phoneme - dependent data augmentation method and use a small number of device - specific recordings for fine - tuning. - Evaluate the performance of different complexity variants and analyze the impact of the number of required device - specific recordings on performance. In summary, this paper aims to develop a low - complexity own - voice reconstruction system suitable for hearables, which can significantly improve voice quality and intelligibility with limited computational resources and a small number of device - specific recordings.

Low Complexity Own Voice Reconstruction for Hearables with an In-ear Microphone

Speech-dependent Data Augmentation for Own Voice Reconstruction with Hearable Microphones in Noisy Environments

Multi-Microphone Noise Data Augmentation for DNN-based Own Voice Reconstruction for Hearables in Noisy Environments

Training Strategies for Own Voice Reconstruction in Hearing Protection Devices using an In-ear Microphone

Speech-dependent Modeling of Own Voice Transfer Characteristics for In-ear Microphones in Hearables

Modeling of Speech-dependent Own Voice Transfer Characteristics for Hearables with In-ear Microphones

Modeling of speech-dependent own voice transfer characteristics for hearables with an in-ear microphone

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Restoring speech intelligibility for hearing aid users with deep learning

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Using deep learning to improve the intelligibility of a target speaker in noisy multi-talker environments for people with normal hearing and hearing loss

Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation

EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech Enhancement

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Low-latency Real-time Voice Conversion on CPU

Recognizing Voice Spoofing Attacks Via Acoustic Nonlinearity Dissection for Mobile Devices

Neural2Speech: A Transfer Learning Framework for Neural-Driven Speech Reconstruction

FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses

Towards sub-millisecond latency real-time speech enhancement models on hearables

Hear Your Face: Face-based voice conversion with F0 estimation