Indexing and Mining Near-Duplicates for Advanced Multimedia Applications

Jia Jun Liu
2012-01-01
Abstract:Multimedia data has gained remarkable significance in the past decades, especially during the popularization of the Internet and its related applications. The term ``multimedia data'' generally refers to videos, images and audio in computer science research. In this thesis, we focus on videos and images particularly. The unprecedented and ever-growing scale of today's multimedia data has lead to the common existence of near-duplicates. Such phenomenon jeopardizes the effectiveness and user experience of the multimedia data-centric applications nowadays. Subsequently, the task of finding near-duplicates is becoming increasingly significant for the research and industry communities, not only for the directly improvement of the user experience with the daily applications but also to cater the needs for the enterprise applications like database cleansing, copyright protection, internal data management and video stream monitoring. In this thesis, we discuss four specific research problems for different applications, all of which are largely based on near-duplicate retrieval and mining. In the first part, we study how to eliminate near-duplicate images to present a location with diverse views. Recently the explosive growth of geo-tagged photos enables many large-scale applications, such as location-based photo browsing, landmark recognition, etc. The existence of massive near-duplicate geo-tagged photos greatly affects the effective presentation for the above applications. In this part, we devise a location presentation framework to efficiently retrieve and present diverse scenes captured within a local proximity. Novel photos, in terms of capture locations and visual content, are identified and returned in response to a query location for diverse views. For real-time response and good scalability, a new Hybrid Index structure is proposed. The second part continues the study on near-duplicate images by proposing a novel approach for the discovery of Areas of Interest (AoIs) based on near-duplicate retrieval and location-based mining. By analyzing both geo-tagged images and check-ins, the approach exploits travelers' flavors as well as the preferences of daily-life activities of local residents to find AoIs in a city. The proposed approach consists of two major steps. Firstly, we devise a density-based clustering method to discover AoIs, mainly based on the image densities but also reinforced by the secondary densities from the images' neighboring venues. Then we propose a novel joint authority analysis framework to rank AoIs. The framework simultaneously considers both the location-location transitions, and the user-location relations. Finally we use an interactive presentation interface that utilizes near-duplicate clusters to present the AoIs. In the third part we use canonical correlation analysis to retrieve heavily changed near-duplicate videos. Very often, near-duplicate videos exhibit great content changes, while the user perceives little information change, e.g., color features change significantly when transforming a color video with a blue filter. These feature changes contribute to low-level video similarity computations, making conventional similarity-based near-duplicate video retrieval techniques incapable of accurately capturing the implicit relationship between two near-duplicate videos with fairly large content modifications. The intuition is that near-duplicate videos should preserve strong information correlation in despite of intensive content changes. In the proposed approach, instead of directly computing the similarity between video, we adopt the canonical correlation analysis to find their possible relation by composing transformations that maximize their correlation. In the last part we present a new string paradigm called VideoGram for large-scale video sequence indexing for applications such as near-duplicate sequence retrieval. In VideoGram, the feature space is modelled as a set of visual words. Each database video sequence is mapped into a Sequence-of-Visual-Words (SoVW) which is further abstracted as a string. A gram-based indexing structure is then built to tackle the effect of the ``curse of dimensionality and support video subsequence matching. Given a high-dimensional query video sequence, it is first expanded into multiple Sequences-of-Likelihood-Visual-Words (SoLVW) which also capture the spatial closeness between query frames and visual words. Video sequence search is then performed by matching query SoLVW and candidate SoVW to avoid high-dimensional similarity computations. A novel query expansion method based on the visual word language model is proposed to offset the quantization effect from a high-dimensional sequence to a string.The discussion for each of the above problems consists of similar parts, which begins with the brief introduction for the motivation and application background, followed by technical details of the proposed approach. Then the experiment settings as well as the experimental results are given to compare the proposed approach against comparative methods on mainly real datasets. Synthetic datasets are used as well in few cases.
What problem does this paper attempt to address?