Towards End-2-end Learning for Predicting Behavior Codes from Spoken Utterances in Psychotherapy Conversations

Karan Singla,Zhuohao Chen,David C Atkins,Shrikanth Narayanan
DOI: https://doi.org/10.18653/v1/2020.acl-main.351
Abstract:Spoken language understanding tasks usually rely on pipelines involving complex processing blocks such as voice activity detection, speaker diarization and Automatic speech recognition (ASR). We propose a novel framework for predicting utterance level labels directly from speech features, thus removing the dependency on first generating transcripts, and transcription free behavioral coding. Our classifier uses a pretrained Speech-2-Vector encoder as bottleneck to generate word-level representations from speech features. This pre-trained encoder learns to encode speech features for a word using an objective similar to Word2Vec. Our proposed approach just uses speech features and word segmentation information for predicting spoken utterance-level target labels. We show that our model achieves competitive results to other state-of-the-art approaches which use transcribed text for the task of predicting psychotherapy-relevant behavior codes.
What problem does this paper attempt to address?