Abstract:When recorders are used to survey acoustically conspicuous species, identification calls of the target species in recordings is essential for estimating density and abundance. We investigate how well deep neural networks identify vocalisations consisting of phrases of varying lengths, each containing a variable number of syllables. We use recordings of Hainan gibbon Nomascus hainanus vocalisations to develop and test the methods. We propose two methods for exploiting the two-level structure of such data. The first combines convolutional neural network (CNN) models with a hidden Markov model (HMM) and the second uses a convolutional recurrent neural network (CRNN). Both models learn acoustic features of syllables via a CNN and temporal correlations of syllables into phrases either via an HMM or recurrent network. We compare their performance to commonly used CNNs LeNet and VGGNet, and support vector machine (SVM). We also propose a dynamic programming method to evaluate how well phrases are predicted. This is useful for evaluating performance when vocalisations are labelled by phrases, not syllables. Our methods perform substantially better than the commonly used methods when applied to the gibbon acoustic recordings. The CRNN has an F-score of 90% on phrase prediction, which is 18% higher than the best of the SVM or LeNet and VGGNet methods. HMM post-processing raised the F-score of these last three methods to as much as 87%. The number of phrases is overestimated by CNNs and SVM, leading to error rates between 49% and 54%. With HMM, these error rates can be reduced to 0.4% at the lowest. Similarly, the error rate of CRNN's prediction is no more than 0.5%. CRNNs are better at identifying phrases of varying lengths composed of a varying number of syllables than simpler CNN or SVM models. We find a CRNN model to be best at this task, with a CNN combined with an HMM performing almost as well. We recommend that these kinds of models are used for species whose vocalisations are structured into phrases of varying lengths.

Automatic Prosodic Structure Labeling using DNN-BGRU-CRF Hybrid Neural Network.

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Prosodic Structure Prediction Using Deep Self-attention Neural Network

Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer.

Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features

Automatic Prosodic Boundary Labeling Based on Fusing the Silence Duration with the Lexical Features

Automated Call Detection for Acoustic Surveys with Structured Calls of Varying Length

Automatic Phrase Boundary Labeling for Mandarin TTS Corpus Using Context-Dependent HMM.

Unsupervised Prosodic Labeling Of Speech Synthesis Databases Using Context-Dependent Hmms

Gated Recurrent Units Based Hybrid Acoustic Models for Robust Speech Recognition

Integrating Prosodic Information Into Recurrent Neural Network Language Model For Speech Recognition

Acoustic and Linguistic Information Based Chinese Prosodic Boundary Labelling

Learning Prosodic Patterns for Mandarin Speech Synthesis

Blstm-Crf Based End-To-End Prosodic Boundary Prediction With Context Sensitive Embeddings In A Text-To-Speech Front-End

Parsing Hierarchical Prosodic Structure For Mandarin Speech Synthesis

Unsupervised word-level prosody tagging for controllable speech synthesis

HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS

A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin

Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks

Prosody Model for Mandarin Text-to-Speech System

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition