Abstract:In recent years, we have witnessed the boom of social media platforms, through which people have been generating a lot of social media data. This data touches almost every aspect of life and may have significant societal and marketing values for a variety of corporations and organizations. Thus, the development of effective techniques for gathering and analyzing social media content has attracted much research attention. As social media data tend to be heterogeneous, conversational, and fast evolving in content, a recent work reported a multifaceted approach to gather comprehensive brand-related data by crawling data using evolving keywords, key users, similar image content, and known locations. Although such approach has been found to be effective in gathering representative data, it also brings in a lot of noise. This paper aims to develop an accurate classifier to filter out noise by taking into account the multimedia content and social nature of brand-related data. In particular, we develop a microblog filtering method based on a discriminative social-aware multiview embedding. Besides the conventional content-based features, such as textual, low-level visual features, and high-level visual semantic features, that form the three key views of microblogs, we also incorporate the brand and social relations among the microblogs to learn a discriminative and social-aware embedding. With such a learned embedding, an off-the-shelf classifier, such as SVM, can then be trained and applied to microblog filtering. We verify the efficacy of our method on noise filtering in the brand data gathering task on the Brand-Social-Net dataset. Our approach is able to achieve significantly better filtering performance and improve the quality of brand data gathering.

Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval.

Online latent semantic hashing for cross-media retrieval.

Weakly Supervised User Profile Extraction from Twitter.

Multiple Kernel Visual-Auditory Representation Learning for Retrieval

Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Brand Data Gathering From Live Social Media Streams

Personalized Knowledge Visualization In Twitter

A Real-Time Method to Predict Social Media Popularity

Facebook5k: A Novel Evaluation Resource Dataset for Cross-Media Search

Compact Indexing and Judicious Searching for Billion-Scale Microblog Retrieval.

Zero-Shot Cross-Media Retrieval with External Knowledge.

A Benchmark Dataset and Learning High-Level Semantic Embeddings of Multimedia for Cross-Media Retrieval.

Microblog Track 2011 of FDU.

Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild

An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

NFT1000: A Cross-Modal Dataset for Non-Fungible Token Retrieval

Filtering of Brand-Related Microblogs Using Social-Smooth Multiview Embedding

Learning a Semantic Space for Modeling Images, Tags and Feelings in Cross-Media Search.

MMChat: Multi-Modal Chat Dataset on Social Media

Internet Cross-Media Retrieval Based on Deep Learning.