Speech Emotion Recognition under Resource Constraints with Data Distillation

Yi Chang,Zhao Ren,Zhonghao Zhao,Thanh Tam Nguyen,Kun Qian,Tanja Schultz,Björn W. Schuller
2024-06-21
Abstract:Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to achieve efficient speech emotion recognition (SER) on resource - constrained Internet of Things (IoT) devices. Specifically, there are two main challenges: 1. **Resource Constraints**: Edge devices are usually limited in terms of computing and memory resources, which makes it difficult to build complex deep - learning models. Therefore, a method is needed to reduce the amount of data and computing resources required for model training. 2. **Privacy Protection**: Voice data contains a large amount of personal information, and directly using this data for model training may lead to privacy leakage. Therefore, a method that can perform effective training without exposing the original data is required. To solve these two problems, the author proposes a **data distillation framework**, which generates a smaller and synthetic dataset by extracting representative information from a large - scale original dataset. This distilled dataset can significantly reduce the required storage space and computing resources while maintaining performance, and also reduce the risk of privacy leakage. ### Specific Objectives: - **Reduce Dataset Size**: Reduce the dataset size to approximately 15% of the original dataset, thereby saving memory space and improving efficiency. - **Reduce Training Iterations**: The number of iterations required to train the model on the distilled dataset is less than that on the original dataset. - **Protect User Privacy**: By using the synthetic small - scale dataset, the possibility of privacy leakage is reduced. ### Method Overview: - **Teacher Trajectory**: Use the original dataset to train multiple deep - learning models (such as CNN - 6, ResNet - 9, VGG - 15), and record the parameter changes of each model during the training process. - **Student Trajectory**: Initialize a small - scale synthetic dataset based on the teacher trajectory, and update the samples in the synthetic dataset by matching the teacher trajectory. - **Loss Function**: Introduce a normalized squared l - norm distance loss function to ensure that the parameter trajectory of the student model is as similar as possible to that of the teacher model. Through this method, the author has successfully achieved efficient speech emotion recognition in a resource - constrained environment, and verified the effectiveness and privacy - protection ability of this method.