Abstract:This paper introduces to a structured application of the One-Class approach and the One-Class-One-Network model for supervised classification tasks, specifically addressing a vowel phonemes classification case study within the Automatic Speech Recognition research field. Through pseudo-Neural Architecture Search and Hyper-Parameters Tuning experiments conducted with an informed grid-search methodology, we achieve classification accuracy comparable to nowadays complex architectures (90.0 - 93.7%). Despite its simplicity, our model prioritizes generalization of language context and distributed applicability, supported by relevant statistical and performance metrics. The experiments code is openly available at our GitHub.
Audio and Speech Processing,Artificial Intelligence,Computation and Language,Databases,Machine Learning,Sound
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to design a neural network feature abstraction layer suitable for speech recognition tasks by proposing a simplified and efficient combination method of Neural Architecture Search (NAS) and Hyperparameter Tuning (HPs - T). Specifically, the paper focuses on the vowel phoneme classification problem, which is an important subtask in the research field of Automatic Speech Recognition (ASR).
### Research Background and Motivation
1. **Importance of Speech Recognition**:
- Vowel phonemes are the basic building blocks in language expression and play a crucial role in the comprehensibility of the language and the conveyance of emotions.
- Vowel phoneme classification is of great significance in multiple applications, such as language learning, pronunciation assessment, dialectology, sociology, forensic speech recognition, assistive technology, emotion recognition, and even brain - computer interfaces.
2. **Limitations of Existing Methods**:
- The current state - of - the - art speech recognition technologies rely on complex machine - learning algorithms and signal - processing techniques. Although these methods have made significant progress in accuracy, they are highly complex and resource - consuming.
- Many existing datasets are insufficient in terms of sample size, audio quality, and the complexity of the covered speech, making it difficult to provide robust generalization solutions.
### Proposed Solutions
1. **OCON Model**:
- The OCON (One - Class - One - Network) model is a collection of parallel - distributed binary classifiers, and each classifier focuses on a simple speech recognition subtask.
- Through pseudo - NAS and hyperparameter tuning experiments, combined with the information grid - search method, this model has achieved a classification accuracy (90.0% - 93.7%) comparable to that of current complex architectures.
- The model emphasizes the generalization ability of the language context and the feasibility of distributed applications, and has been verified through relevant statistics and performance indicators.
2. **Feature Processing and Optimization**:
- The researchers used the HGCW dataset, which provides a higher level of speech complexity and contains pre - extracted formant data.
- Further refinement processing was carried out on the formant frequency trajectories, including linear normalization and min - max scaling, to enhance the class separation.
- By introducing techniques such as Dropout, Batch Normalization, and L2 regularization, the model performance was gradually optimized, ultimately increasing the prediction accuracy and shortening the training time.
### Main Contributions
- Proposed a simplified neural architecture search and hyperparameter tuning method suitable for speech recognition tasks.
- Verified the effectiveness of the OCON model in vowel phoneme classification, achieving an accuracy comparable to that of complex architectures.
- Emphasized the generalization ability and distributed application potential of the model, especially in an environment with limited computing resources.
### Conclusions and Future Work
- The research shows that larger datasets or models do not necessarily bring better accuracy, and simplified methods can also achieve good generalization effects.
- Future research can further explore optimization methods for label selection and consider introducing training - guaranteed scaling coefficients to improve the reliability of output probabilities.
- Expand the sources of datasets, including TI - MIT, UCLAPhoneticsSet, and AudioSet, etc., to verify the wide applicability of the model.
Through these efforts, the researchers aim to provide an efficient and easy - to - implement solution for the field of speech recognition and promote broader academic and technical applications.