Abstract:Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to achieve efficient speech emotion recognition (SER) on resource - constrained Internet of Things (IoT) devices. Specifically, there are two main challenges: 1. **Resource Constraints**: Edge devices are usually limited in terms of computing and memory resources, which makes it difficult to build complex deep - learning models. Therefore, a method is needed to reduce the amount of data and computing resources required for model training. 2. **Privacy Protection**: Voice data contains a large amount of personal information, and directly using this data for model training may lead to privacy leakage. Therefore, a method that can perform effective training without exposing the original data is required. To solve these two problems, the author proposes a **data distillation framework**, which generates a smaller and synthetic dataset by extracting representative information from a large - scale original dataset. This distilled dataset can significantly reduce the required storage space and computing resources while maintaining performance, and also reduce the risk of privacy leakage. ### Specific Objectives: - **Reduce Dataset Size**: Reduce the dataset size to approximately 15% of the original dataset, thereby saving memory space and improving efficiency. - **Reduce Training Iterations**: The number of iterations required to train the model on the distilled dataset is less than that on the original dataset. - **Protect User Privacy**: By using the synthetic small - scale dataset, the possibility of privacy leakage is reduced. ### Method Overview: - **Teacher Trajectory**: Use the original dataset to train multiple deep - learning models (such as CNN - 6, ResNet - 9, VGG - 15), and record the parameter changes of each model during the training process. - **Student Trajectory**: Initialize a small - scale synthetic dataset based on the teacher trajectory, and update the samples in the synthetic dataset by matching the teacher trajectory. - **Loss Function**: Introduce a normalized squared l - norm distance loss function to ensure that the parameter trajectory of the student model is as similar as possible to that of the teacher model. Through this method, the author has successfully achieved efficient speech emotion recognition in a resource - constrained environment, and verified the effectiveness and privacy - protection ability of this method.

Speech Emotion Recognition under Resource Constraints with Data Distillation

Fast Yet Effective Speech Emotion Recognition with Self-distillation

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speech Emotion Recognition Based on Clustering Assistance

Knowledge Transfer For On-Device Speech Emotion Recognition with Neural Structured Learning

hierarchical network with decoupled knowledge distillation for speech emotion recognition

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models

TrustSER: On the Trustworthiness of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition

An HASM-Assisted Voice Disguise Scheme for Emotion Recognition of IoT-enabled Voice Interface

A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model

Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation

Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Dataset-Distillation Generative Model for Speech Emotion Recognition