Abstract:Audio Sentiment Analysis is a popular research area which extends the text-based sentiment analysis to depend on effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment analysis mainly focuses on extracting homogeneous acoustic features or doesn't fuse heterogeneous features effectively. In this paper, we propose an utterance-based deep neural network model, which has a parallel combination of CNN and LSTM based network, to obtain representative features termed Audio Sentiment Vector (ASV), that can maximally reflect sentiment information in an audio. Specifically, our model is trained by utterance-level labels and ASV can be extracted and fused creatively from two branches. In the CNN model branch, spectrum graphs produced by signals are fed as inputs while in the LSTM model branch, inputs include spectral centroid, MFCC and other recognized traditional acoustic features extracted from dependent utterances in an audio. Besides, BiLSTM with attention mechanism is used for feature fusion. Extensive experiments have been conducted to show our model can recognize audio sentiment precisely, and demonstrate our ASV are better than traditional acoustic features or vectors extracted from other deep learning models. Furthermore, experimental results indicate that the proposed model outperforms state-of-the-art approaches by 9.33% on MOSI.

Weakly Labelled AudioSet Tagging With Attention Neural Networks

Audio Set classification with attention model: A probabilistic perspective

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Speech enhancement with weakly labelled data from AudioSet

Meta learning based audio tagging.

Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation

Staged training strategy and multi-activation for audio tagging with noisy and sparse multi-label data

A Joint Detection-Classification Model for Weakly Supervised Sound Event Detection Using Multi-Scale Attention Method

An empirical study of weakly supervised audio tagging embeddings for general audio representations

Audio-Based Music Classification with DenseNet And Data Augmentation

Exploration of Audio Quality Assessment and Anomaly Localisation Using Attention Models

A Closer Look at Weak Label Learning for Audio Events

Multi-label Zero-Shot Audio Classification with Temporal Attention

An Attention-Based Neural Network Approach For Single Channel Speech Enhancement

SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual Attention.

A Deep Neural Network for Audio Classification with a Classifier Attention Mechanism

Improving Speech Enhancement Using Audio Tagging Knowledge from Pre-Trained Representations and Multi-Task Learning

Jointly Trained Sequential Labeling and Classification by Sparse Attention Neural Networks.

Audio tagging with noisy labels and minimal supervision

Joint Music and Language Attention Models for Zero-shot Music Tagging

An Attention Mechanism for Musical Instrument Recognition