M. Nanmalar,S. Johanan Joysingh,P. Vijayalakshmi,T. Nagarajan
Abstract:In ideal human computer interaction (HCI), the colloquial form of a language would be preferred by most users, since it is the form used in their day-to-day conversations. However, there is also an undeniable necessity to preserve the formal literary form. By embracing the new and preserving the old, both service to the common man (practicality) and service to the language itself (conservation) can be rendered. Hence, it is ideal for computers to have the ability to accept, process, and converse in both forms of the language, as required. To address this, it is first necessary to identify the form of the input speech, which in the current work is between literary and colloquial Tamil speech. Such a front-end system must consist of a simple, effective, and lightweight classifier that is trained on a few effective features that are capable of capturing the underlying patterns of the speech signal. To accomplish this, a one-dimensional convolutional neural network (1D-CNN) that learns the envelope of features across time, is proposed. The network is trained on a select number of handcrafted features initially, and then on Mel frequency cepstral coefficients (MFCC) for comparison. The handcrafted features were selected to address various aspects of speech such as the spectral and temporal characteristics, prosody, and voice quality. The features are initially analyzed by considering ten parallel utterances and observing the trend of each feature with respect to time. The proposed 1D-CNN, trained using the handcrafted features, offers an F1 score of 0.9803, while that trained on the MFCC offers an F1 score of 0.9895. In light of this, feature ablation and feature combination are explored. When the best ranked handcrafted features, from the feature ablation study, are combined with the MFCC, they offer the best results with an F1 score of 0.9946.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to distinguish between the literary form (Literary Tamil, LT) and the colloquial form (Colloquial Tamil, CT) of Tamil. These two forms have significant differences in vocabulary and acoustic characteristics, especially between the colloquial form used in daily conversations and the literary form in formal texts. To achieve this goal, the author proposes a feature - engineering method based on one - dimensional convolutional neural network (1D - CNN) for classification.
### Specific Problem Description
1. **Identification of Language Forms**:
- In an ideal human - computer interaction (HCI) system, users are more inclined to use the colloquial form for communication. However, for a language as long - standing and rich as Tamil, protecting its literary form is also important.
- In order to balance practicality and language protection, the computer needs to be able to process and identify these two forms of language input. Therefore, the primary task is to develop a front - end system that can accurately distinguish between Literary Tamil and Colloquial Tamil.
2. **Effective Feature Extraction**:
- To build such a classifier, features that can effectively capture the underlying patterns of speech signals must be selected. These features should be able to reflect the differences between Literary Tamil and Colloquial Tamil.
- The author proposes to use hand - crafted features and learn the time - series information of these features through one - dimensional convolutional neural network (1D - CNN).
3. **Performance Optimization**:
- To improve the performance of the classifier, the author not only uses hand - crafted features but also uses Mel - Frequency Cepstral Coefficients (MFCC) for comparison, and finally explores the effect of combining the two features.
- Through feature ablation study, the author quantifies and ranks the contribution of each hand - crafted feature, thereby further optimizing the performance of the classifier.
### Solutions
- **Feature Selection**:
- Hand - crafted features include: fundamental frequency (F0), energy, voicing probability, jitter, derivative of jitter, shimmer, harmonic - to - noise ratio (HNR), spectral flux, psychoacoustic sharpness, and zero - crossing rate.
- **Model Architecture**:
- Use one - dimensional convolutional neural network (1D - CNN) to learn the time - series information of these features. The advantage of 1D - CNN is that it can directly learn complex patterns from one - dimensional data, and has a low computational complexity, which is suitable for real - time applications.
- **Experimental Results**:
- The 1D - CNN using hand - crafted features achieved an F1 - score of 0.9803, and the 1D - CNN using MFCC features achieved an F1 - score of 0.9895. After combining the best hand - crafted features and MFCC features, the F1 - score was further improved to 0.9946.
Through these methods, the paper successfully solves the classification problem between Literary Tamil and Colloquial Tamil and provides a valuable reference for future research.