Abstract:In this paper, we research the new topic of object effects recommendation in micro-video platforms, which is a challenging but important task for many practical applications such as advertisement insertion. To avoid the problem of introducing background bias caused by directly learning video content from image frames, we propose to utilize the meaningful body language hidden in 3D human pose for recommendation. To this end, in this work, a novel human pose driven object effects recommendation network termed PoseRec is introduced. PoseRec leverages the advantages of 3D human pose detection and learns information from multi-frame 3D human pose for video-item registration, resulting in high quality object effects recommendation performance. Moreover, to solve the inherent ambiguity and sparsity issues that exist in object effects recommendation, we further propose a novel item-aware implicit prototype learning module and a novel pose-aware transductive hard-negative mining module to better learn pose-item relationships. What's more, to benchmark methods for the new research topic, we build a new dataset for object effects recommendation named Pose-OBE. Extensive experiments on Pose-OBE demonstrate that our method can achieve superior performance than strong baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the task of object effects recommendation in short - video platforms. Specifically, given a short - video and a set of candidate object effects, the goal is to score and rank these items according to the video content, so as to recommend the items that are most suitable for the video. This helps to intelligently add advertisements or special effects during video editing to improve the quality of the video. ### Main problems 1. **Background bias**: Directly learning content from video frames will lead to the introduction of background bias, because deep - learning models tend to capture background information rather than human behaviors. 2. **Inherent ambiguity and sparsity**: There are multiple solutions and difficulties in distinguishing positive and negative samples when associating human postures with items. ### Solutions To solve the above problems, the author proposes the following methods: #### 1. **PoseRec network** - **3D human - posture - driven**: Use 3D human postures to extract the core content of the video instead of directly learning features from the entire video. Extract the posture of each frame through BlazePose and represent it as a spatio - temporal graph, and then use a graph convolutional network (GCN) to learn high - level video content. - **Feature mapping**: Map the features of the video and items to a shared feature space, and calculate the similarity to obtain the recommendation score. #### 2. **Item - aware implicit prototype learning module** - **Solve ambiguity**: By introducing an implicit prototype learning module, cluster different items into different prototype groups, so that items with similar characteristics can be correctly associated. Each item is mapped to the prototype space, and the recommendation score is calculated according to its relevance to the prototype. #### 3. **Pose - aware transductive hard negative sample mining module** - **Solve sparsity**: By using 3D human - posture information, dynamically select negative samples to avoid misidentifying positive samples of other videos as negative samples. Specifically, calculate the similarity between different videos and only select negative samples from dissimilar videos for mining. ### Dataset To evaluate this task, the author constructs a new dataset, Pose - OBE, which contains 212 short - videos and 1,087 object effects. Each video is annotated with the most suitable special - effect item by professionals. ### Experimental results The experimental results show that PoseRec is significantly superior to multiple baseline methods in network performance, especially in instance recommendation and category recommendation. ### Formula summary - **Feature mapping formula**: \[ e_v = W_1 g_v + b_1 \] \[ e_i = W_2 s_i + b_2 \] - **Recommendation score calculation formula**: \[ y_{i,v} = \frac{e_i \cdot e_v}{|e_i||e_v|} \] - **Prototype contribution calculation formula**: \[ \omega'_{i,k} = \text{sim}(e_{i,c}, r_k), \quad \omega_{i,:} = \text{softmax}(\omega'_{i,:}) \] - **Final recommendation score formula**: \[ y_{i,v} = \sum_{k = 1}^{K} \omega_{i,k} \cdot \text{sim}(e_i^{(k)}, e_v^{(k)}) \] Through these methods, the author has successfully solved the key challenges of object - effects recommendation in short - video platforms, demonstrating the great potential of 3D human postures in personalized recommendation.

Human Pose Driven Object Effects Recommendation

PoseRec: 3D Human Pose Driven Online Advertisement Recommendation for Micro-videos

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

An Effective 3D Human Pose Estimation Method Based on Dilated Convolutions for Videos.

Deep Dual Consecutive Network for Human Pose Estimation

Towards Fine-Grained Human Pose Transfer With Detail Replenishing Network

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Motion Capture Research: 3D Human Pose Recovery Based on RGB Video Sequences

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

Human Motion Transfer from Poses in the Wild

Towards Accurate Human Pose Estimation in Videos of Crowded Scenes

Recent Advances in 3D Human Pose Estimation: From Optimization to Implementation and Beyond

AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation

Hybrid 3D Human Pose Estimation with Monocular Video and Sparse IMUs

3d Pose Detection Of Closely Interactive Humans Using Multi-View Cameras

Relation-Based Associative Joint Location for Human Pose Estimation in Videos

Robust 3D Human Pose Estimation from Single Images or Video Sequences

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Kinematics-based 3D Human-Object Interaction Reconstruction from Single View