Abstract:in video understanding, a core task is to detect the objects in frames and recently state-of-art methods are proposed to exhaustively detect the possible objects. Few studies are discussing exact these objects by the importance of object roles, which refer to ones that are relatively concerned and attract more attention of the audience in a short video clip. This paper intends to represent the audience's attention transfer mechanism by attractive objects (key object) and proposes a key object detection network based on temporal reinforcement learning (RL). Temporal RL means our model structure using LSTM to output RL reward values dynamically. In a sense, it avoids formal reward setting may cause the RL result does not achieve expectation. In order to mark the object that attracts the audience's attention in successive switching video clips, our method keeps an eye of the attractive object by iterating the value of temporal RL strategy. The proposed model can detect multiple key objects simultaneously with the help of a temporal RL strategy that analyses the attention transfer. Specifically, the spatial features and temporal features, extracted by the ensemble convolution networks, are sent into the RL model to represent the objects, corresponding motions and less relative backgrounds. Google AVA [24] dataset, annotate the position and action of the main characters among many existing objects, are selected as the experiment dataset. Compared with other models, the results show that the proposed method can continuously focus on the attractive objects in the sequenced fragments. Different from the general structure of the deep reinforcement model based on DQN network, our temporal RL parameters are mainly generated and calculated in the KOD-attention block. Therefore, in the process of simulating the attention transfer mechanism, our method uses lightweight computation to achieve satisfactory detection precision and speed in the performance of key object detection.

Neural architecture impact on identifying temporally extended Reinforcement Learning tasks

Attention or memory? Neurointerpretable agents in space and time

Architecting and Visualizing Deep Reinforcement Learning Models

Virtual Augmented Reality for Atari Reinforcement Learning

Towards Interpretable Reinforcement Learning Using Attention Augmented Agents

Better Deep Visual Attention with Reinforcement Learning in Action Recognition.

Reinforcement Learning and its Connections with Neuroscience and Psychology

Racing with Vision Transformer Architecture

Learning to reinforcement learn for Neural Architecture Search

Action-Conditional Video Prediction using Deep Networks in Atari Games

A Neural Network Model of Visual Attention Integrating Biased Competition and Reinforcement Learning

Task-Induced Representation Learning

A Neuromorphic Architecture for Reinforcement Learning from Real-Valued Observations

Emergent Solutions to High-Dimensional Multitask Reinforcement Learning

Temporal Shift Reinforcement Learning

Improving the sample-efficiency of neural architecture search with reinforcement learning

Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

Role‐based Attention in Deep Reinforcement Learning for Games

Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning

Following Instructions by Imagining and Reaching Visual Goals

Video Key Object Detection Network via Reinforcement Learning