Abstract:Object-oriented micro-video background music recommendation is a complicated task where the matching degree between videos and background music is a major issue. However, music selections in user-generated content (UGC) are prone to selection bias caused by historical preferences of uploaders. Since historical preferences are not fully reliable and may reflect obsolete behaviors, over-reliance on them should be avoided as knowledge and interests dynamically evolve. In this paper, we propose a Deconfounded Cross-Modal (DecCM) matching model to mitigate such bias. Specifically, uploaders’ personal preferences of music genres are identified as confounders that spuriously correlate music embeddings and background music selections, causing the learned system to over-recommend music from majority groups. To resolve such confounders, backdoor adjustment is utilized to deconfound the spurious correlation between music embeddings and prediction scores. We further utilize Monte Carlo (MC) estimator with batch-level average as the approximations to avoid integrating the entire confounder space calculated by the adjustment. Furthermore, we design a teacher-student network to utilize the matching of music videos, which is professionally-generated content (PGC) with specialized matching, to better recommend content-matching background music. The PGC data is modeled by a teacher network to guide the matching of uploader-selected UGC data of student network by Kullback-Leibler-based knowledge transfer. Extensive experiments on the TT-150k-genre dataset demonstrate the effectiveness of the proposed method. The code is publicly available on: https://github.com/jing-1/DecCM.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the selection bias problem in micro - video background music recommendation. Specifically: 1. **Selection bias caused by uploaders' historical preferences**: - Uploaders on user - generated content (UGC) platforms are often influenced by their personal historical preferences when choosing background music. Such historical preferences can lead to over - recommendation of certain music genres, thus exacerbating the exposure bias and the "echo chamber" effect. - For example, users who like hip - hop music are more likely to choose hip - hop music as background music, which makes the recommendation system more likely to recommend mainstream music genres and ignore other potentially more suitable content. 2. **Inexpert matching problems**: - Many amateur uploaders lack professionalism when choosing background music, resulting in a low matching degree between the video and the background music. Relying on such low - quality matching data for training will affect the performance of the recommendation system, making it difficult to recommend high - quality, well - matched background music for new videos. 3. **The impact of dynamic changes in interests**: - Historical preferences are not always reliable because users' interests and knowledge change over time. Over - relying on these historical data may cause the recommendation system to be unable to adapt to users' latest interests. To address these problems, the author proposes a deconfounded cross - modal matching model (Deconfounded Cross - Modal, DecCM) to reduce the impact of uploaders' historical preferences on background music recommendation and improve the robustness and matching quality of the recommendation system. ### Key points of the solution - **Deconfounding techniques**: Eliminate the spurious correlation between uploaders' historical preferences and music embedding learning and background music selection through causal graph modeling and backdoor adjustment. - **Teacher - student network**: Utilize high - quality matching data in professional - generated content (PGC) to guide the learning of user - generated content (UGC) data through knowledge distillation, so as to better recommend background music that matches the content. - **Monte Carlo estimation**: Use the Monte Carlo estimation method to approximately calculate the backdoor adjustment to handle complex confounding factors and improve computational efficiency. Through these methods, this model can largely alleviate the selection bias and provide more diverse and high - quality background music recommendations.

Deconfounded Cross-modal Matching for Content-based Micro-video Background Music Recommendation

Debiased Cross-modal Matching for Content-based Micro-video Background Music Recommendation

Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation

Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

Learning to Embed Music and Metadata for Context-Aware Music Recommendation

Learning Music Embedding with Metadata for Context Aware Recommendation

MIC: Model-agnostic Integrated Cross-channel Recommender

Background Music Recommendation on Short Video Sharing Platforms

Video-Music Retrieval:A Dual-Path Cross-Modal Network

Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval

Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint

Unified Pretraining Target Based Video-music Retrieval With Music Rhythm And Video Optical Flow Information

Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

Improving Micro-video Recommendation via Contrastive Multiple Interests

Enhancing Music Recommendation with Social Media Content: an Attentive Multimodal Autoencoder Approach

MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video

Multimodal Graph Contrastive Learning for Multimedia-Based Recommendation

CAME: Content- and Context-Aware Music Embedding for Recommendation

Personalized Micro-video Recommendation Based on Multi-modal Features and User Interest Evolution

Deep Content-User Embedding Model for Music Recommendation