Cues extraction and hierarchical HMM based events inference in soccer video
Guoying Jin,Linmi Tao,Guangyou Xu
DOI: https://doi.org/10.1049/ic.2005.0713
2005-01-01
Abstract:This paper proposed a systematical framework to address the essential problem, the semantic gap between extractable lowlevel features and meaningful high-level semantics, in content-based retrieval. Low-level features, which can be directly extracted from video streams, are color histogram, inter-frame differences, edges, etc. Theoretically, it is possible to detect events from these features based on hidden Markov models (HMMs) or dynamical Bayesian networks (DBNs), but, in practice, the models are too complicate to be built and to be trained. This paper proposed to employ cues as a middle stone to bridge the gap between the low-level features and the highlevel events. Cues have two salient characteristics: they hold causality with the events, and they can be deduced from features or extracted from video streams. Based on this idea, a systematical framework is constructed to analysis soccer videos, which is selected as a test bed for the fact that events (i.e. shoot event, foul event, and normal process) can be clearly defined in soccer game. First of all, the input video stream is segmented to shots based on the features directly extracted from videos; secondly, se-mantic cues, such as slow motion replay, face of player, caption, goalmouth, shot frequency, etc., are deduced or extracted from the shots; thirdly, three HMMs are built and trained to infer the three events from cues. In general video streams contain more than one event, thus an unavoidable problem is shots should be group to sets, in which there is only one event, for HMM-based events inference. In other words, shots should be appropriately grouped into sets of shots, so that the input observation sequences (a set of shots) fed into HMMs fit at least one of the models. Due to this self-enwound control structure, a hierarchical HMM (HHMM) is employed to group shots and to recognize events simultaneously in video stream. The experiments show the system is effective and robust in inferring events from roughly deduced or extracted cues.