Abstract:The automatic generation of anchor-style product promotion videos presents promising opportunities in online commerce, advertising, and consumer engagement. However, this remains a challenging task despite significant advancements in pose-guided human video generation. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Additionally, we introduce the HOI-region reweighting loss, a training objective that enhances the learning of object details. Extensive experiments demonstrate that our proposed system outperforms existing methods in preserving object appearance and shape awareness, while simultaneously maintaining consistency in human appearance and motion. Project page: <a class="link-external link-https" href="https://cangcz.github.io/Anchor-Crafter/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the technical challenges in generating high - quality, realistic anchor - style product promotion videos. Although significant progress has been made in pose - guided human video generation, existing methods are still unable to effectively handle human - object interactions (HOI). Specifically, the existing methods have the following limitations: 1. **Lack of object perception**: Most methods treat objects as static textures and are unable to generate appropriate interaction actions. 2. **Inability to control object trajectories**: Existing methods have difficulty understanding and controlling the motion trajectories of objects. 3. **Complex interaction management**: Handling the complex interaction relationships between human motion and objects is a difficult problem, especially in the case of object occlusion and multi - view. To solve these problems, the paper proposes **AnchorCrafter**, a diffusion - model - based system aimed at generating high - quality anchor - style product promotion videos by integrating human - object interactions (HOI). The main innovations of AnchorCrafter include: 1. **HOI - appearance perception**: - Improve the accuracy of object appearance through multi - view feature fusion and decoupling network structures. - Use CLIP features to extract global representations and enhance the separation of object and human appearances. 2. **HOI - motion injection**: - Use depth maps and hand 3D mesh inputs to precisely control object trajectories. - Reduce interaction artifacts through occlusion handling strategies to achieve complex interaction management. 3. **HOI - region reweighting loss**: - During the training process, enhance the learning of object details by emphasizing the hand - object interaction regions. Through these innovations, AnchorCrafter is able to generate high - fidelity videos while maintaining the accuracy of object appearance and shape, and performs well in terms of the consistency of human appearance and motion. Experimental results show that AnchorCrafter outperforms existing methods on multiple evaluation metrics.

AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation

OAW-GAN: Occlusion-Aware Warping GAN for Unified Human Video Synthesis

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

Compositional 3D Human-Object Neural Animation

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Automatic Generation of Interactive Nonlinear Video for Online Apparel Shopping Navigation

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

DreaMoving: A Human Video Generation Framework based on Diffusion Models

Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations

FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability

MotionBooth: Motion-Aware Customized Text-to-Video Generation

AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

Action2video: Generating Videos of Human 3D Actions

Interactive Humanoid: Online Full-Body Motion Reaction Synthesis with Social Affordance Canonicalization and Forecasting

Do as I Do: Pose Guided Human Motion Copy

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

AnimateAnything: Consistent and Controllable Animation for Video Generation