A Recurrent Neural Network Approach to the Answering Machine Detection Problem

Kemal Altwlkany,Sead Delalic,Elmedin Selmanovic,Adis Alihodzic,Ivica Lovric
DOI: https://doi.org/10.1109/MIPRO60963.2024.10569812
2024-10-08
Abstract:In the field of telecommunications and cloud communications, accurately and in real-time detecting whether a human or an answering machine has answered an outbound call is of paramount importance. This problem is of particular significance during campaigns as it enhances service quality, efficiency and cost reduction through precise caller identification. Despite the significance of the field, it remains inadequately explored in the existing literature. This paper presents an innovative approach to answering machine detection that leverages transfer learning through the YAMNet model for feature extraction. The YAMNet architecture facilitates the training of a recurrent-based classifier, enabling real-time processing of audio streams, as opposed to fixed-length recordings. The results demonstrate an accuracy of over 96% on the test set. Furthermore, we conduct an in-depth analysis of misclassified samples and reveal that an accuracy exceeding 98% can be achieved with the integration of a silence detection algorithm, such as the one provided by FFmpeg.
Sound,Machine Learning,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the **Answering Machine Detection (AMD) problem**. Specifically, it attempts to accurately distinguish in real - time calls whether the call is answered by a real person or by a voice mailbox or an automatic answering machine. This problem is of great significance in the telecommunications and cloud communication platforms, especially in marketing activities, which can significantly improve service quality and efficiency and reduce costs. #### Background and Importance 1. **Real - time and Accuracy**: In marketing activities, if it is possible to quickly and accurately determine whether a call is answered by a real person or a voice mailbox, marketers can play advertising or promotional information only on calls answered by real people, avoiding unnecessary call charges. 2. **Limitations of Existing Solutions**: Although there are currently many proprietary solutions providing AMD functions, there is relatively little publicly available research literature, and these proprietary solutions usually do not disclose their algorithms and technical details. Therefore, there is a lack of transparency and verifiable results. 3. **Application Scenarios**: AMD is not only applied in marketing, but also very important in scenarios such as call centers. For example, call centers can skip calls answered by machines through automatic dialers, thereby improving the work efficiency of human agents. #### Main Contributions of the Paper 1. **Review of Current AMD Solutions**: The paper provides a comprehensive review of existing AMD solutions, including proprietary software and research progress. 2. **Proposing a New Deep - Learning Method**: The paper proposes a new method based on Recurrent Neural Network (RNN). By using transfer learning and the YAMNet model for feature extraction, it realizes the real - time processing of audio streams. 3. **Supporting Modern AMD Features**: The new method not only improves the detection accuracy, but also supports some modern AMD features, such as the mute detection algorithm (using FFmpeg), which further improves the system performance. 4. **Flexibility and Scalability**: This method allows users to adjust the behavior of AMD by setting hyper - parameters (such as timeout time, confidence threshold, and minimum detection time) to adapt to different application scenarios. #### Experimental Results - **Accuracy on the Test Set**: The model achieved an accuracy of 96.67% on the test set. After the optimization of the mute detection module, the accuracy was improved to 98.10%. - **Real - time Processing Ability**: The inference time of the model is very short, with an average inference time of 31.63 milliseconds per frame, which is suitable for real - time applications. #### Conclusions and Future Work The method proposed in the paper not only meets the industry standards in performance, but also provides users with a flexible and easily extensible AMD solution. Future research directions include: - Providing language/region - based data set analysis. - Integrating the mute detection module into the classifier as an additional input. - Exploring the application of data augmentation techniques. - Conducting a direct comparison of different solutions on the same representative data set, considering model accuracy, inference speed, and resource consumption. - Studying how to stably deploy stateful models in the production environment. Through these improvements, this method is expected to be promoted and applied in more practical applications.