Comparative analysis of CNN & RNN for voice pathology detection (Preprint)
Sidra Syed
DOI: https://doi.org/10.2196/preprints.27123
2021-01-12
Abstract:BACKGROUND Background: Diagnosis on the basis of a computerized acoustic examination may play an incredibly important role in early diagnosis and in monitoring and even improving effective pathological speech diagnostics. Various acoustic metrics test the health of the voice. The precision of these parameters also has to do with algorithms for the detection of speech noise. The idea is to detect the disease pathology from the voice. First we apply the feature extraction on the SVD dataset. After the feature extraction the system input goes into the 27 neuronal layer neural networks that are convolutional and recurrent neural network. Result: After divided the dataset into training and testing and after 10k fold validation the reported accuracies of CNN and RNN are 87.11% and 86.52% respectively. A 10-fold cross validation was used for performance evaluation. Software code was written in Python using the TensorFlow module on a Linux workstation with one NVidia Titan X GPU. Conclusion: The research work experiments are done by selecting the subjects belonging to a few of the diseases randomly chosen from the standard database. The voices are generally categorized as normal and pathological. Machine learning approach to detect voice disorders based on various acoustic metrics is the promising non-invasive method. OBJECTIVE To detect Voice pathologies through deep learning methods and compare there results METHODS SVD stands for Saarbrücken Voice Database. In table 1 characteristics of SVD dataset are presented. Basically SVD is a publically available database which is a collection of voice recordings by over 2000 people. 1) Vocal registration [I a, u] produced at standard, high and low pitches. The truth was recorded in a recording session. 2) Vocal documentation of increasing pitch [I a, u]. 3) Recording of the phrase'' Good morning, how do you like it?''(' How are you, good morning?'). The voice signal and the EGG signal were stored in individual files for the specified components [10]. The database has text file includes all relevant information about the dataset. Those characteristics make it a good choice for experimenters to use. All recorded SVD voices were sampled with a resolution of 16-bit at 50 kHz. There are some recording sessions where not all vowels are included in each version, depending on the quality of their recording. The' Saarbruecken Voice Server' is available via this web interface. It contains multiple internet pages which are used to choose parameters for the database application, to play directly and records and pick the recording session files which are to be exported after chosen desired parameter from SVD database [11]. From SVD database, the disease we have selected are Balbuties','Dysphonie','Frontolaterale Teilresektion','Funktionelle Dysphonie','Vox senilis', 'Zentral-laryngaleBewegungsstörung', 'ReinkeÖdem', 'Stimmlippenpolyp', 'Stimmlippenkarzinom', 'SpasmodischeDysphonie', 'Psychogene Dysphonie', and 'Leukoplakie' [10]. Table 1. Characteristics of SVD dataset 3.2. Feature: The features that are extracted from samples to perform this study are 13 MFCC features, Pitch, Rolloff, ZCR, Energy entropy, Spectral flux, Spectral centroid and Energy. 3.2.1. Mel-Frequency Cepstral Coefficients(MFCC): In 1980, MFCC was suggested by Davis and Mermelstein for the most widely used speech recognition feature [19]. Primarily, the exhaustion method for the MFCC function involves windowing the signal, applying the DFT, acquiring the magnitude protocol and then shaming the values and a Mel rank on scale, then applying a reverse DCT. The cepstral coefficients normally include only details from a specific frame and are considered static attributes. The machine first and second derivatives of cepstral coefficients have the additional information on time dynamics of the signal [20]. 3.2.2. Pitch: The pitch corresponds to the level at which during a noise voicing cord vibrates. Standard approaches such as the autocorrelation system and the method of average magnitude differential at max, resulting in half and double-half defects, are vulnerable to mutation during the removal of tonnes. By distinguishing the acoustic pulse cepstrum from the vocal tract cepstrum, the cepstrum system may approximate the pitch. At the cost of complex measurements, it has high detection performance for regular voice signal [18]. 3.3. Neural Networks: 3.3.1. CNN architecture: The CNN has several hierarchy levels composed of routing layers and grouping layers, which are defined by a broad variety of charts. In general, CNN begins with a convolutionary layer that accepts input level data. For convolutionary operations with few filter maps of the same dimension, the convolution layer is liable. In addition, output from this layer is transferred to the sample layer that decreases the scale of the next layers. CNN is locally related to a vast variety of deep learning techniques. These networks are then implemented on the basis of GPU architecture on a number of hundred cores. The role maps will be allocated on the basis of the previous layer knowledge blocks [21]. It depends on the dimensions of the maps, therefore. However. Each thread is bound to a single neuron by means of a single block of many threads. Similarly, neuron convolution, induction and summation is carried out over the remainder of the method. Finally, a global memory stores the performance of above processes. A backward and propagation model is adopted for the efficient processing of results. However a single spread would not yield positive outcomes, so pulling or moving operations contribute to parallel spread. In addition, the neurons of a single layer interact with a separate number of neurons, influencing the boundary effects [23]. In figure 1, general architecture of CNN is explaining the working of this deep learning neural network. A deep learning algorithm includes input preprocessing, deep learning model training, storage of the learned model, and the last phase of the model implementation. In these phases the most computational (or data intensive activity is to train the deep learning algorithms (defining and running). The model is provided some input through a neural network that produces some output at the specified step (also called forward transmission). The weights are changed if the performance is inappropriate or inaccurate (backward pass). This could be like a basic matrix multiplication, where of input (first matrix row) for such unique output objects is multiplied by weight (second matrix column). Serial systems (CPU-based) are typically not feasible for higher order matrices (large inputs and weighs). Fortunately, GPU delivers much superior options than conventional single or cluster CPU systems [22] Graphic Processing Units for General purposes. Figure 1. Architecture of CNN [21] 3.3.2. RNN architecture: Long Short-Term Memory (LSTM) is a special architecture of the recurring neural network (RNN) constructed more reliably than traditional RNNs, and is designed to model temporal sequences and their long-range dependencies. Recently, we have shown that LSTM RNN is more powerful than DNNs and standard acoustic modelling, taking into account models of moderate size trained on a single computer. We illustrate the potential to achieve the newest technology in speech recognition with a two-layer deep RNN LSTM with a linear repeating projection layer. In figure 2, the LTSM RNN general architecture represent the working flow of model. This design uses the model parameters more efficiently than other parameters, converges fast and outperforms a deep neural network feed with a higher magnitude order. Speaking is a dynamic signal with time fluctuations with complex associations on a number of timescales. Recurring neural networks (RNs) have cyclic ties that render them more efficient than feedforward neural networks in modelling certain sequence data. RNNs have been very effective in sequence marking and prediction activities such as handwriting and language detection [24]. Figure 2. Architecture of LSTM-RNN [24] The key distinction between CNN and RNN is the capacity to process transient or sequentially produced knowledge for example, in a phrase. In comparison, convolutionary neural networks and repetitive neural networks are used for entirely different uses, and the neural network architectures themselves vary to match these different cases of use. In order to convert results, CNNs use filter in convolution layers. In comparison, RNNs reuse activation functions from other sequential data points to build the following sequence production. Although this is an often discussed query, the distinction between CNN and RNN becomes apparent as you analyses the nature of neural networks and realize what they are used for. 4. Experiments and evaluation: 4.1. Proposed Model layer of CNN and RNN: Figure 3 and 4 demonstrate the internal layering diagram of the proposed model of CNN and RNN. In proposed methodology both CNN and RNN are 27 neuronal layer architecture with different bias values. RESULTS The idea is to detect the disease pathology from the voice. First we apply the feature extraction on the SVD dataset. In proposed methodology the features that we have extracted are 13 MFCC features, Pitch, Rolloff, ZCR, Energy entropy, Spectral flux, Spectral centroid and Energy. After the feature extraction the system input goes into the 27 neuronal layer neural networks that are convolutional and recurrent neural network. We divided the dataset into training and testing and after 10k fold validation the reported accuracies of CNN and RNN in table 2 are 87.11% and 86.52% respectively. The convolutional kernels consisted of 7 residual layers, totaling 27convolutional layers. Dropout with a 0.5 keep probability and L2 normalization was utilized. A 10-fold cross validation was used for performance evaluation. Software code was written in Python using the TensorFlow module on a Linux workstation with one NVidia Titan X GPU. Figure 5 and 6 represents the detailed accuracy and error evaluation with the lines draw for training testing phase. The graphs lines are joined in RNN evaluations which shows that the error margin is very minor but in CNN evaluation the is the differences between the lines which show the probability of error margin in proposed CNN algorithm is higher than the RNN. Figure 7 and 8 represents the confusion matrix with the value that the show the number of correct diagnosis of the system. CONCLUSIONS The idea is to detect the disease pathology from the voice. First we apply the feature extraction on the SVD dataset. In proposed methodology the features that we have extracted are 13 MFCC features, Pitch, Rolloff, ZCR, Energy entropy, Spectral flux, Spectral centroid and Energy. After the feature extraction the system input goes into the 27 neuronal layer neural networks that are convolutional and recurrent neural network. We divided the dataset into training and testing and after 10k fold validation the reported accuracies of CNN and RNN in table 2 are 87.11% and 86.52% respectively. The convolutional kernels consisted of 7 residual layers, totaling 27convolutional layers. Dropout with a 0.5 keep probability and L2 normalization was utilized. A 10-fold cross validation was used for performance evaluation. Software code was written in Python using the TensorFlow module on a Linux workstation with one NVidia Titan X GPU. Figure 5 and 6 represents the detailed accuracy and error evaluation with the lines draw for training testing phase. The graphs lines are joined in RNN evaluations which shows that the error margin is very minor but in CNN evaluation the is the differences between the lines which show the probability of error margin in proposed CNN algorithm is higher than the RNN. Figure 7 and 8 represents the confusion matrix with the value that the show the number of correct diagnosis of the system.