Scale-invariant visual language modeling for object categorization

Lei Wu,Yang Hu,Mingjing Li,Nenghai Yu,Xian-Sheng Hua
DOI: https://doi.org/10.1109/TMM.2008.2009692
IF: 7.3
2009-01-01
IEEE Transactions on Multimedia
Abstract:In recent years, "bag-of-words" models, which treat an image as a collection of unordered visual words, have been widely applied in the multimedia and computer vision fields. However, their ignorance of the spatial structure among visual words makes them indiscriminative for objects with similar word frequencies but different word spatial distributions. In this paper, we propose a visual language modeling method (VLM), which incorporates the spatial context of the local appearance features into the statistical language model. To represent the object categories, models with different orders of statistical dependencies have been exploited. In addition, the multilayer extension to the VLM makes it more resistant to scale variations of objects. The model is effective and applicable to large scale image categorization. We train scale invariant visual language models based on the images which are grouped by Flickr tags, and use these models for object categorization. Experimental results show they achieve better performance than single layer visual language models and "bag-of-words" models. They also achieve comparable performance with 2-D MHMM and SVM-based methods, while costing much less computational time.
What problem does this paper attempt to address?