Abstract:A key preprocessing step in multimodal interfaces is to detect when a user is speaking to the system. While push-to-talk approaches are effective, its use limits the flexibility of the system. Solutions based on speech activity detection (SAD) offer more intuitive and user-friendly alternatives. A limitation in current SAD solutions is the drop in performance observed in noisy environments or when the speech mode differs from neutral speech (e.g., whisper speech). Emerging audiovisual solutions provide a principled framework to improve detection of speech boundaries by incorporating lip activity detection. In our previous work, we proposed an unsupervised visual speech activity detection (V-SAD) system that combines temporal and dynamic facial features. The key limitation of the system was the precise detection of boundaries between speech and non-speech regions due to anticipatory facial movements and low video resolution (29.97fps). This study builds upon this system by (a) combining speech and facial features creating an unsupervised audiovisual speech activity detection (AV-SAD) system, (b) refining the decision boundary with the Bayesian information criterion (BIC) algorithm, resulting in improved speech boundary detection. The evaluation considers the challenging case of whisper speech, where the proposed AV-SAD achieves a 10% absolute improvement over a state-of-the-art audio SAD.

Computationally Efficient Audio Segmentation Through a Multi-Stage BIC Approach

Efficient Audio Stream Segmentation Via the Combined T-2 Statistic and Bayesian Information Criterion

Unsupervised audio stream segmentation and clustering via the Bayesian information criterion

Audio Segmentation Based on Wavelet Transform

Research on the Improved Hybrid Segmentation Algorithm for Audio

A Two-Stage Content-Based Audio Segmentation Algorithm

An Effective Real-Time Audio Segmentation Method Based on Time-Frequency Energy Analysis

Using confidence measures to evaluate the speaker turns in speaker segmentation

Audio Segmentation Based on Size-Fixed Window and Layer Detection

Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian Information Criterion

Speaker Segmentation and Clustering Based on the Improved Spectral Clustering

Method of Speakers Segmentation Based on Pre-Segmentation

A Novel Classification-Based Audio Segmentation Algorithm

Audio Segmentation Based On Multi-Scale Audio Classification

Multi-speaker Segmentation and Clustering of Telephone Speech

A CIF-Based Speech Segmentation Method for Streaming E2E ASR

An Automatic Approach Towards Audio Segmentation And Classification

Adaptive threshold method for real-time audio segmentation

Subband Energy Distance Measure Applied in Multi-Pass Speech/Non-Speech Discrimination

An Improved Speaker Based Speech Segmentation Algorithm

A new DP-like speaker clustering algorithm