Abstract:Audio Sentiment Analysis is a popular research area which extends the conventional text-based sentiment analysis to depend on the effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment analysis mainly focuses on extracting homogeneous acoustic features or doesn't fuse heterogeneous features effectively. In this paper, we propose an utterance-based deep neural network model, which has a parallel combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based network, to obtain representative features termed Audio Sentiment Vector (ASV), that can maximally reflect sentiment information in an audio. Specifically, our model is trained by utterance-level labels and ASV can be extracted and fused creatively from two branches. In the CNN model branch, spectrum graphs produced by signals are fed as inputs while in the LSTM model branch, inputs include spectral features and cepstrum coefficient extracted from dependent utterances in audio. Besides, Bidirectional Long Short-Term Memory (BiLSTM) with attention mechanism is used for feature fusion. Extensive experiments have been conducted to show our model can recognize audio sentiment precisely and quickly, and demonstrate our ASV is better than traditional acoustic features or vectors extracted from other deep learning models. Furthermore, experimental results indicate that the proposed model outperforms the state-of-the-art approach by 9.33\% on Multimodal Opinion-level Sentiment Intensity dataset (MOSI) dataset.

Audio Feature Learning with Triplet-Based Embedding Network.

Triplet Convolutional Network for Music Version Identification.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Personalized Music Recommendation with Triplet Network

End-to-End Feature Learning for Text-Independent Speaker Verification

Self-Attention Networks for Text-Independent Speaker Verification

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Deep Ranking: Triplet MatchNet for Music Metric Learning

Dual Path Embedding Learning for Speaker Verification with Triplet Attention

Triplet Enhanced AutoEncoder: Model-free Discriminative Network Embedding.

SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody

Multi-view Audio and Music Classification

Using Deep Belief Network to Capture Temporal Information for Audio Event Classification.

Regression-based music emotion prediction using triplet neural networks

Audio-Based Music Classification with DenseNet And Data Augmentation

Audio Embeddings as Teachers for Music Classification

Stereo Feature Enhancement and Temporal Information Extraction Network for Automatic Music Transcription

Triplet-Center Loss Based Deep Embedding Learning Method for Speaker Verification

Improving Triplet-Wise Training Of Convolutional Neural Network For Vehicle Re-Identification

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification