Abstract:To simulate human interaction in real life, dialog systems are introduced to generate a response to previous chat utterances. There have been several studies for two-speaker video dialogs in the form of question answering. However, more informative semantic cues might be exploited via a multi-rounds chatting or discussing about the video among multiple speakers. So multi-speakers video dialogs are more applicable in real life. Besides, speakers always chat about a sub-segment of the long video fragment for a period of time. Current video dialog systems require to be directly given the relevant video sub-segment which speakers are chatting about. However, it is always hard to accurately spot the corresponding video sub-segment in practical applications. In this paper, we introduce a novel task of Multi-Speaker Video Dialog with frame-level Temporal Localization (MSVD-TL) to make video dialog systems more applicable. Given a long video fragment and a set of chat history utterances, MSVD-TL targets to predict the following response and localize the relevant video sub-segment in frame level, simultaneously. We develop a new multi-task model with a response prediction module and a frame-level temporal localization module. Besides, we focus on the characteristic of the video dialog generation process and exploit the relation among the video fragment, the chat history, and the following response to refine their representations. We evaluate our approach for both the Multi-Speaker Video Dialog without frame-level temporal localization (MSVD w/o TL) task and the MSVD-TL task. The experimental results further demonstrate that MSVD-TL enhances the applicability of video dialog in real life.

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

From FiLM to Video: Multi-turn Question Answering with Multi-modal Context

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Video Dialog Via Progressive Inference and Cross-Transformer.

TAVT: Towards Transferable Audio-Visual Text Generation.

VSET: A MULTIMODAL TRANSFORMER FOR VISUAL SPEECH ENHANCEMENT

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

Multi-Speaker Video Dialog with Frame-Level Temporal Localization

Some Can Be Better Than All: Multimodal Star Transformer for Visual Dialog

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.

A Study on Joint Modeling and Data Augmentation of Multi-Modalities for Audio-Visual Scene Classification

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

Transavs: End-To-End Audio-Visual Segmentation With Transformer