Abstract:The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.

The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge

The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

The USTC-NELSLIP Systems for CHiME-6 Challenge

The USTC-NERCSLIP Systems for The ICMC-ASR Challenge

Multimedia Simultaneous Translation System for Minority Language Communication with Mandarin

Summary on the Chat-Scenario Chinese Lipreading (chatclr) Challenge

The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

The Sjtu System for Multimodal Information Based Speech Processing Challenge 2021

The USTC-iFlytek system for CHiME-4 challenge

Multimodal Dialogue Understanding via Holistic Modeling and Sequence Labeling.

The USTC-iFlytek Systems for CHiME-5 Challenge

Multi-Modal Knowledge Transfer for Target Speaker Lipreading with Improved Audio-Visual Pretraining and Cross-Lingual Fine-Tuning

Directional Source Separation for Robust Speech Recognition on Smart Glasses

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

The USTC-Ximalaya System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription (m2met) Challenge

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

The THU-SPMI CHiME-4 system : Lightweight design with advanced multi-channel processing , feature enhancement , and language modeling

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge