Abstract:Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the venue where the micro-video is recorded, for example, in a concert or on a beach. However, according to our statistics over two million micro-videos, only $1.22%$ of them were labeled with location information. For the remaining large number of micro-videos without location information, we have to rely on their content to estimate their venue categories. This is a highly challenging task, as micro-videos are naturally multi-modal (with textual, visual and, acoustic content), and more importantly, the quality of each modality varies greatly for different micro-videos. In this work, we focus on enhancing the acoustic modality for the venue category estimation task. This is motivated by our finding that although the acoustic signal can well complement the visual and textual signal in reflecting a micro-video's venue, its quality is usually relatively lower. As such, simply integrating acoustic features with visual and textual features only leads to suboptimal results, or even adversely degrades the overall performance (cf the barrel theory). To address this, we propose to compensate the shortest board --- the acoustic modality --- via harnessing the external sound knowledge. We develop a deep transfer model which can jointly enhance the concept-level representation of micro-videos and the venue category prediction. To alleviate the sparsity problem of unpopular categories, we further regularize the representation learning of micro-videos of the same venue category. Through extensive experiments on a real-world dataset, we show that our model significantly outperforms the state-of-the-art method in terms of both Micro-F1 and Macro-F1 scores by leveraging the external acoustic knowledge.

Shorter-is-Better

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

Enhancing Micro-video Understanding by Harnessing External Sounds

Hierarchy-Dependent Cross-Platform Multi-View Feature Learning for Venue Category Prediction

Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object Relations

TME: Tree-guided Multi-task Embedding Learning towards Semantic Venue Annotation

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning.

Venue Prediction for Social Images by Exploiting Rich Temporal Patterns in LBSNs.

Enabling the interpretability of pretrained venue representations using semantic categories

Embedding Hierarchical Structures for Venue Category Representation

VRer: Context-Based Venue Recommendation Using Embedded Space Ranking SVM in Location-Based Social Network

GEVR: An Event Venue Recommendation System for Groups of Mobile Users

Semantic-Based Location Recommendation With Multimodal Venue Semantics

Attention-enhanced and trusted multimodal learning for micro-video venue recognition

Micro Tells Macro: Predicting The Popularity Of Micro-Videos Via A Transductive Model

Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

Learning Fine-grained User Interests for Micro-video Recommendation

Multi-modal Tag Localization for Mobile Video Search.

Micro-video Tagging via Jointly Modeling Social Influence and Tag Relation

A Collaborative Ranking Model with Multiple Location-based Similarities for Venue Suggestion