Abstract:Audio Sentiment Analysis is a popular research area which extends the conventional text-based sentiment analysis to depend on the effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment analysis mainly focuses on extracting homogeneous acoustic features or doesn't fuse heterogeneous features effectively. In this paper, we propose an utterance-based deep neural network model, which has a parallel combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based network, to obtain representative features termed Audio Sentiment Vector (ASV), that can maximally reflect sentiment information in an audio. Specifically, our model is trained by utterance-level labels and ASV can be extracted and fused creatively from two branches. In the CNN model branch, spectrum graphs produced by signals are fed as inputs while in the LSTM model branch, inputs include spectral features and cepstrum coefficient extracted from dependent utterances in audio. Besides, Bidirectional Long Short-Term Memory (BiLSTM) with attention mechanism is used for feature fusion. Extensive experiments have been conducted to show our model can recognize audio sentiment precisely and quickly, and demonstrate our ASV is better than traditional acoustic features or vectors extracted from other deep learning models. Furthermore, experimental results indicate that the proposed model outperforms the state-of-the-art approach by 9.33\% on Multimodal Opinion-level Sentiment Intensity dataset (MOSI) dataset.

A Combined Feature Approach for Speaker Segmentation Using Convolution Neural Network.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Multi-feature Combination for Speaker Recognition

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Auditory Model Based Speech Feature Extraction and Its Application to Speaker Identification

Multi-speaker Segmentation and Clustering of Telephone Speech

A Feature Integration Network for Multi-Channel Speech Enhancement

Auditory Features For The Close Talk Speech Enhancement With Parameter Masks

Speaker Verification based on Single Channel Speech Separation

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer

Fusion of deep shallow features and models for speaker recognition

Bidirectional Multiscale Feature Aggregation for Speaker Verification

Robust Front-End for Speech Recognition Based on Computational Auditory Scene Analysis and Speaker Model

Multi-level Fusion of Audio and Visual Features for Speaker Identification

Combining Speech Enhancement and Discriminative Feature Extraction for Robust Speaker Recognition

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Using Phoneme Recognition and Text-Dependent Speaker Verification to Improve Speaker Segmentation for Chinese Speech.

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

Speaker verification using attentive multi-scale convolutional recurrent network

Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition