Speech-based Age and Gender Prediction with Transformers

Felix Burkhardt,Johannes Wagner,Hagen Wierstorf,Florian Eyben,Björn Schuller
2023-06-29
Abstract:We report on the curation of several publicly available datasets for age and gender prediction. Furthermore, we present experiments to predict age and gender with models based on a pre-trained wav2vec 2.0. Depending on the dataset, we achieve an MAE between 7.1 years and 10.8 years for age, and at least 91.1% ACC for gender (female, male, child). Compared to a modelling approach built on handcrafted features, our proposed system shows an improvement of 9% UAR for age and 4% UAR for gender. To make our findings reproducible, we release the best performing model to the community as well as the sample lists of the data splits.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to predict the age and gender of speakers using a transformer architecture based on the pre-trained wav2vec 2.0 model. Specifically, the paper focuses on the following aspects: 1. **Dataset Compilation**: The paper compiles several publicly available datasets for the tasks of age and gender prediction. These datasets include SpeechDat II, CommonVoice, aGender, TIMIT, and VoxCeleb2. 2. **Model Performance Evaluation**: The paper evaluates the model's performance on different datasets, including single-task models (predicting only age or gender) and multi-task models (predicting both age and gender simultaneously). 3. **Cross-Dataset Generalization**: The paper explores the model's generalization ability across different datasets, i.e., how a model trained on one dataset performs on other unseen datasets. 4. **Impact of Model Layers**: The paper studies the impact of the number of transformer layers on model performance to find the optimal balance between accuracy and speed. 5. **Comparison with Traditional Methods**: The paper compares deep learning-based methods with traditional hand-crafted feature-based methods, demonstrating the advantages of deep learning approaches. 6. **Emotion Data Prediction**: The paper also tests the model's performance on emotional speech data, exploring the impact of emotional expression on prediction accuracy. ### Main Contributions 1. **Proposed a New System**: A fine-tuned transformer model to estimate age and gender. 2. **Provided Curated Sample Sets**: Including lists of samples for training, development, and testing, and made them publicly available to the research community. 3. **Compared Single-Task and Multi-Task Models**: Evaluated the performance of single-task and multi-task models. 4. **Reported Cross-Dataset Results**: Examined the model's generalization ability. 5. **Studied the Impact of Transformer Layers**: Determined the number of layers needed to achieve the best balance between accuracy and speed. 6. **Released the Best Performing Model**: Published the best-performing model for public use. Through these studies, the paper aims to advance the technology in the field of age and gender prediction and provide benchmarks and references for subsequent research.