A Neural ODE and Transformer-based Model for Temporal Understanding and Dense Video Captioning

Sainithin Artham,Soharab Hossain Shaikh
DOI: https://doi.org/10.1007/s11042-023-17809-1
IF: 2.577
2024-01-13
Multimedia Tools and Applications
Abstract:Dense video captioning is a challenging task. Generating detailed and precise captions for every moment in a video necessitates a deep comprehension of both visual and temporal nuances. In this study, we present an innovative method to address this challenge. Our method leverages the combined power of the VidSwin transformer and the Liquid Time Constant (LTC) network, which is a neural ordinary differential equation (ODE) model, with a focus on both efficiency and adaptability. Unlike traditional models such as LSTM and GRU, which often demand thousands of neurons for localization tasks, our approach accomplishes similar localization performance with fewer than 100 neurons. This efficient neural architecture excels in event localization, enabling us to generate comprehensive and contextually accurate captions based on the localized content within the video. This architecture efficiently captures spatial-temporal video representations while exhibiting adaptability to different temporal patterns. Extensive experimentations have been carried out with the YouCook2 and ActivityNet datasets. Results affirm that our method outperforms many existing methods, showcasing its effectiveness in generating event proposals and video captions.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?