Deconfounded Cross-modal Matching for Content-based Micro-video Background Music Recommendation

Jing Yi,Zhenzhong Chen
DOI: https://doi.org/10.1145/3650042
IF: 5
2024-03-06
ACM Transactions on Intelligent Systems and Technology
Abstract:Object-oriented micro-video background music recommendation is a complicated task where the matching degree between videos and background music is a major issue. However, music selections in user-generated content (UGC) are prone to selection bias caused by historical preferences of uploaders. Since historical preferences are not fully reliable and may reflect obsolete behaviors, over-reliance on them should be avoided as knowledge and interests dynamically evolve. In this paper, we propose a Deconfounded Cross-Modal (DecCM) matching model to mitigate such bias. Specifically, uploaders’ personal preferences of music genres are identified as confounders that spuriously correlate music embeddings and background music selections, causing the learned system to over-recommend music from majority groups. To resolve such confounders, backdoor adjustment is utilized to deconfound the spurious correlation between music embeddings and prediction scores. We further utilize Monte Carlo (MC) estimator with batch-level average as the approximations to avoid integrating the entire confounder space calculated by the adjustment. Furthermore, we design a teacher-student network to utilize the matching of music videos, which is professionally-generated content (PGC) with specialized matching, to better recommend content-matching background music. The PGC data is modeled by a teacher network to guide the matching of uploader-selected UGC data of student network by Kullback-Leibler-based knowledge transfer. Extensive experiments on the TT-150k-genre dataset demonstrate the effectiveness of the proposed method. The code is publicly available on: https://github.com/jing-1/DecCM.
computer science, information systems, artificial intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the selection bias problem in micro - video background music recommendation. Specifically: 1. **Selection bias caused by uploaders' historical preferences**: - Uploaders on user - generated content (UGC) platforms are often influenced by their personal historical preferences when choosing background music. Such historical preferences can lead to over - recommendation of certain music genres, thus exacerbating the exposure bias and the "echo chamber" effect. - For example, users who like hip - hop music are more likely to choose hip - hop music as background music, which makes the recommendation system more likely to recommend mainstream music genres and ignore other potentially more suitable content. 2. **Inexpert matching problems**: - Many amateur uploaders lack professionalism when choosing background music, resulting in a low matching degree between the video and the background music. Relying on such low - quality matching data for training will affect the performance of the recommendation system, making it difficult to recommend high - quality, well - matched background music for new videos. 3. **The impact of dynamic changes in interests**: - Historical preferences are not always reliable because users' interests and knowledge change over time. Over - relying on these historical data may cause the recommendation system to be unable to adapt to users' latest interests. To address these problems, the author proposes a deconfounded cross - modal matching model (Deconfounded Cross - Modal, DecCM) to reduce the impact of uploaders' historical preferences on background music recommendation and improve the robustness and matching quality of the recommendation system. ### Key points of the solution - **Deconfounding techniques**: Eliminate the spurious correlation between uploaders' historical preferences and music embedding learning and background music selection through causal graph modeling and backdoor adjustment. - **Teacher - student network**: Utilize high - quality matching data in professional - generated content (PGC) to guide the learning of user - generated content (UGC) data through knowledge distillation, so as to better recommend background music that matches the content. - **Monte Carlo estimation**: Use the Monte Carlo estimation method to approximately calculate the backdoor adjustment to handle complex confounding factors and improve computational efficiency. Through these methods, this model can largely alleviate the selection bias and provide more diverse and high - quality background music recommendations.