Application of Audio Fingerprinting Techniques for Real-Time Scalable Speech Retrieval and Speech Clusterization

Kemal Altwlkany,Sead Delalić,Adis Alihodžić,Elmedin Selmanović,Damir Hasić
2024-10-29
Abstract:Audio fingerprinting techniques have seen great advances in recent years, enabling accurate and fast audio retrieval even in conditions when the queried audio sample has been highly deteriorated or recorded in noisy conditions. Expectedly, most of the existing work is centered around music, with popular music identification services such as Apple's Shazam or Google's Now Playing designed for individual audio recognition on mobile devices. However, the spectral content of speech differs from that of music, necessitating modifications to current audio fingerprinting approaches. This paper offers fresh insights into adapting existing techniques to address the specialized challenge of speech retrieval in telecommunications and cloud communications platforms. The focus is on achieving rapid and accurate audio retrieval in batch processing instead of facilitating single requests, typically on a centralized server. Moreover, the paper demonstrates how this approach can be utilized to support audio clustering based on speech transcripts without undergoing actual speech-to-text conversion. This optimization enables significantly faster processing without the need for GPU computing, a requirement for real-time operation that is typically associated with state-of-the-art speech-to-text tools.
Information Retrieval,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to achieve fast and accurate voice retrieval and voice clustering on telecommunication and cloud communication platforms, especially to classify early - media (such as voicemail, busy tone prompts, etc.) in the absence of accompanying SIP response codes**. Specifically, the paper focuses on: 1. **Differences between voice and music**: Existing audio fingerprinting techniques are mainly designed for music, and the spectral content of voice is different from that of music. Therefore, existing techniques need to be modified to adapt to the characteristics of voice. 2. **Requirements for real - time processing**: The system needs to be able to process voice data quickly in a real - time environment and does not need to rely on expensive GPU computing resources. 3. **No actual speech - to - text conversion required**: By optimizing audio fingerprinting techniques, audio files can be classified and clustered according to voice content without performing actual speech - to - text (STT) conversion. ### Specific problem description - **Diversity of early - media files**: Early - media files (such as voicemail, busy tone prompts, etc.) vary greatly from country to country and region to region. These files are divided into different clusters, and each cluster represents a specific state (such as busy number, user out of service area, etc.). - **Missing or incorrect SIP response codes**: Sometimes SIP response codes may be missing or incorrect, resulting in the inability to directly obtain relevant information from network protocols. At this time, analyzing the content of early - media becomes particularly important. ### Solution The paper proposes a solution based on audio fingerprinting techniques, which can: - **Real - time identification of early - media**: Quickly identify early - media files by generating audio fingerprints and querying the database. - **Automatic classification and clustering**: Assign early - media files to the correct clusters according to voice content without performing actual speech - to - text conversion. - **Database expansion**: For unrecognized audio files, the system will automatically add them to the database and provide identification capabilities for the same files that may appear in the future. ### Main contributions 1. **Audio fingerprinting techniques adapted to voice**: Existing audio fingerprinting techniques have been adjusted to make them more suitable for processing voice signals. 2. **Efficient real - time processing**: By using Locality - Sensitive Hashing (LSH) and batch query techniques, efficient real - time processing has been achieved. 3. **Reduced dependence on GPU**: By optimizing the algorithm, the system can operate without relying on expensive GPU resources, reducing operating costs. 4. **No speech - to - text conversion required**: Classify and cluster voice content directly through audio fingerprinting techniques, avoiding the complex speech - to - text process. In short, this paper aims to solve the problem of how to classify and cluster early - media files efficiently and accurately in telecommunication and cloud communication platforms, especially in the absence of SIP response codes.