Abstract:Audio Sentiment Analysis is a popular research area which extends the conventional text-based sentiment analysis to depend on the effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment analysis mainly focuses on extracting homogeneous acoustic features or doesn't fuse heterogeneous features effectively. In this paper, we propose an utterance-based deep neural network model, which has a parallel combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based network, to obtain representative features termed Audio Sentiment Vector (ASV), that can maximally reflect sentiment information in an audio. Specifically, our model is trained by utterance-level labels and ASV can be extracted and fused creatively from two branches. In the CNN model branch, spectrum graphs produced by signals are fed as inputs while in the LSTM model branch, inputs include spectral features and cepstrum coefficient extracted from dependent utterances in audio. Besides, Bidirectional Long Short-Term Memory (BiLSTM) with attention mechanism is used for feature fusion. Extensive experiments have been conducted to show our model can recognize audio sentiment precisely and quickly, and demonstrate our ASV is better than traditional acoustic features or vectors extracted from other deep learning models. Furthermore, experimental results indicate that the proposed model outperforms the state-of-the-art approach by 9.33\% on Multimodal Opinion-level Sentiment Intensity dataset (MOSI) dataset.

Audio Recognition using Mel Spectrograms and Convolution Neural Networks

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks

Robust sound event classification using deep neural networks

Temporal Coding of Local Spectrogram Features for Robust Sound Recognition

A CNN Sound Classification Mechanism Using Data Augmentation

Robust Audio Sensing with Multi-Sound Classification.

A convolutional neural network approach for acoustic scene classification

Deep Learning Approaches for Understanding Simple Speech Commands

MelCochleaGram-DeepCNN: Sequentially Fused Spectrogram and the DeepCNN Classifiers-based Audio Spoof Detection System

Audio-Based Music Classification with DenseNet And Data Augmentation

Environmental Sound Recognition Based on Double-input Convolutional Neural Network Model

Using audio content and emotional response to predict soundscape perception through machine learning

Spectral images based environmental sound classification using CNN with meaningful data augmentation

Rethinking environmental sound classification using convolutional neural networks: optimized parameter tuning of single feature extraction

Real-Time Vehicle Sound Detection System Based on Depthwise Separable Convolution Neural Network and Spectrogram Augmentation

Advanced Framework for Animal Sound Classification With Features Optimization

High-Level CNN and Machine Learning Methods for Speaker Recognition

Bridging auditory perception and natural language processing with semantically informed deep neural networks

Audio segmentation based on melodic style with hand-crafted features and with convolutional neural networks