Human Pose Driven Object Effects Recommendation

Zhaoxin Fan,Fengxin Li,Hongyan Liu,Jun He,Xiaoyong Du
DOI: https://doi.org/10.48550/arXiv.2209.08353
2022-09-17
Abstract:In this paper, we research the new topic of object effects recommendation in micro-video platforms, which is a challenging but important task for many practical applications such as advertisement insertion. To avoid the problem of introducing background bias caused by directly learning video content from image frames, we propose to utilize the meaningful body language hidden in 3D human pose for recommendation. To this end, in this work, a novel human pose driven object effects recommendation network termed PoseRec is introduced. PoseRec leverages the advantages of 3D human pose detection and learns information from multi-frame 3D human pose for video-item registration, resulting in high quality object effects recommendation performance. Moreover, to solve the inherent ambiguity and sparsity issues that exist in object effects recommendation, we further propose a novel item-aware implicit prototype learning module and a novel pose-aware transductive hard-negative mining module to better learn pose-item relationships. What's more, to benchmark methods for the new research topic, we build a new dataset for object effects recommendation named Pose-OBE. Extensive experiments on Pose-OBE demonstrate that our method can achieve superior performance than strong baselines.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the task of object effects recommendation in short - video platforms. Specifically, given a short - video and a set of candidate object effects, the goal is to score and rank these items according to the video content, so as to recommend the items that are most suitable for the video. This helps to intelligently add advertisements or special effects during video editing to improve the quality of the video. ### Main problems 1. **Background bias**: Directly learning content from video frames will lead to the introduction of background bias, because deep - learning models tend to capture background information rather than human behaviors. 2. **Inherent ambiguity and sparsity**: There are multiple solutions and difficulties in distinguishing positive and negative samples when associating human postures with items. ### Solutions To solve the above problems, the author proposes the following methods: #### 1. **PoseRec network** - **3D human - posture - driven**: Use 3D human postures to extract the core content of the video instead of directly learning features from the entire video. Extract the posture of each frame through BlazePose and represent it as a spatio - temporal graph, and then use a graph convolutional network (GCN) to learn high - level video content. - **Feature mapping**: Map the features of the video and items to a shared feature space, and calculate the similarity to obtain the recommendation score. #### 2. **Item - aware implicit prototype learning module** - **Solve ambiguity**: By introducing an implicit prototype learning module, cluster different items into different prototype groups, so that items with similar characteristics can be correctly associated. Each item is mapped to the prototype space, and the recommendation score is calculated according to its relevance to the prototype. #### 3. **Pose - aware transductive hard negative sample mining module** - **Solve sparsity**: By using 3D human - posture information, dynamically select negative samples to avoid misidentifying positive samples of other videos as negative samples. Specifically, calculate the similarity between different videos and only select negative samples from dissimilar videos for mining. ### Dataset To evaluate this task, the author constructs a new dataset, Pose - OBE, which contains 212 short - videos and 1,087 object effects. Each video is annotated with the most suitable special - effect item by professionals. ### Experimental results The experimental results show that PoseRec is significantly superior to multiple baseline methods in network performance, especially in instance recommendation and category recommendation. ### Formula summary - **Feature mapping formula**: \[ e_v = W_1 g_v + b_1 \] \[ e_i = W_2 s_i + b_2 \] - **Recommendation score calculation formula**: \[ y_{i,v} = \frac{e_i \cdot e_v}{|e_i||e_v|} \] - **Prototype contribution calculation formula**: \[ \omega'_{i,k} = \text{sim}(e_{i,c}, r_k), \quad \omega_{i,:} = \text{softmax}(\omega'_{i,:}) \] - **Final recommendation score formula**: \[ y_{i,v} = \sum_{k = 1}^{K} \omega_{i,k} \cdot \text{sim}(e_i^{(k)}, e_v^{(k)}) \] Through these methods, the author has successfully solved the key challenges of object - effects recommendation in short - video platforms, demonstrating the great potential of 3D human postures in personalized recommendation.