Msam: A Multi-Layer Bi-Lstm Based Speech To Vector Model With Residual Attention Mechanism

Dongdong Cui,Shouyi Yin,Jiangyuan Gu,Leibo Liu,Shaojun Wei
DOI: https://doi.org/10.1109/edssc.2019.8753946
2019-01-01
Abstract:Word embedding is one of the most popular representation of a document vocabulary. It is capable of capturing the context, semantic and syntactic similarity of words in a document. Word2vec is a well-known technique to learn word embeddings of fixed dimensionality by using shallow neural networks, which can also be used to transform the audio segment of each words into a vector. In this paper, a deep neural network based on speech to vector model is proposed to learn the vector directly from the speech segment, in which the vector can represent some semantic information. Unlike the previous methods, such as speech2vec [1] our proposed model adopts a high-performance parser based on the residual attention mechanism, which uses multi-layer bi-directional long short-term memory (LSTM) network to learn representations of the audio segment. Finally, our proposed speech to vector model is analyzed and evaluated on 12 public datasets, which are widely-used in word similarity and word analogy benchmarks.
What problem does this paper attempt to address?