Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding

Hanqiao Huang,Yamin Han,Peng Zhang,Wei Huang
DOI: https://doi.org/10.1016/j.displa.2021.102055
IF: 3.074
2021-09-01
Displays
Abstract:In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.
engineering, electrical & electronic,instruments & instrumentation,optics,computer science, hardware & architecture
What problem does this paper attempt to address?