Abstract:Temporal moment localization using natural language (TMLNL) is an emerging issue in computer vision for localizing a specific moment inside a long, untrimmed video. The goal of TMLNL is to obtain the video’s output moment, which is related to the input query in a substantial way. Previous research focused on the visual portion of TMLNL, such as objects, backdrops, and other visual attributes, but natural language processing (NLP) techniques were largely used for the textual portion. A long query requires sufficient context to properly localize moments within a long untrimmed video. Thus, as a consequence of not completely understanding how to handle queries, performances deteriorated, especially when the query was longer. In this paper, we treat the TMLNL challenge as a unique variation of VQA, which equally considers the visual elements by using our proposed VQA joint visual-textual framework (JVTF). However, we also manage complex and long input queries without employing natural language processing (NLP) by improving poorly graded to finely graded distinct granularity representations. Our suggested BCPN searches for insufficient context for long input queries using an approach called query handler (QH) and helps the JVTF find the most relevant moment. Previously, a recurrence of words was caused by increasing the number of encoding layers in transformers, LSTMs, and other NLP techniques; however, our QH ensured that repetition of word locations was reduced. The output of BCPN is combined with JVTF’s guided attention to further improve the end outcome. Therefore, we propose a novel bidirectional context predictor network (BCPN), in addition to a VQA joint visual-textual framework (JVTF), to address the equal importance of videos and queries. Through extensive experiments on three benchmark datasets, we show that the proposed BCPN outperforms the state-of-the-art methods by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$IoU = 0.3 (2.65 \%) $ </tex-math></inline-formula> , <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$IoU = 0.5 (2.49 \%)$ </tex-math></inline-formula> , and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$IoU = 0.7 (2.06 \%) $ </tex-math></inline-formula> .

ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022

ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022

A Simple Transformer-Based Model for Ego4D Natural Language Queries Challenge

ObjectNLQ @ Ego4D Episodic Memory Challenge 2024

Action Sensitivity Learning for the Ego4D Episodic Memory Challenge 2023

Temporal Moment Localization via Natural Language by Utilizing Video Question Answers as a Special Variant and Bypassing NLP for Corpora

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Language-enhanced object reasoning networks for video moment retrieval with text query

Multi-Level Query Interaction for Temporal Language Grounding

Natural Language Video Localization with Learnable Moment Proposals

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Context-Enhanced Video Moment Retrieval with Large Language Models

ReLaX-VQA: Residual Fragment and Layer Stack Extraction for Enhancing Video Quality Assessment

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

A Simple LLM Framework for Long-Range Video Question-Answering

Single-Stage Visual Query Localization in Egocentric Videos

Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos

QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval