HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models

CVPR 2025 Workshop of HuMoGen

Xiaogang Peng^1*, Yiming Xie^1*, Zizhao Wu², Varun Jampani³, Deqing Sun⁴, Huaizu Jiang¹

¹Northeastern University, ²Hangzhou Dianzi University, ³Stability AI, ⁴Google Research
* indicates equal contribution

HOI-Diff can generate realistic motions for 3D human-object interactions given a text prompt and object geometry.

Abstract

We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points into the classifier-guidance to achieve accurate and close contact between humans and objects. To train and evaluate our approach, we annotate BEHAVE dataset with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate that our approach produces realistic HOIs with various interactions and different types of objects.

Method

Our key insight is to decompose the generation task into three modules: (a) coarse 3D HOI generation using a dual-branch diffusion model (HOI-DM), (b) affordance prediction diffusion model (APDM) to estimate the contacting points of humans and objects, and (c) afforance-guided interaction correction, which incorporates the estimated contacting information and employs the classifier-guidance to achieve accurate and close contact between humans and objects to form coherent HOIs.

Results

(Our model can generate interactions with both dynamic and static objects)

Interacting with the Same Object using Different Human Body Parts

A person is hoisting a large box using his left hand.

A person is gripping a large box at the front.

A person is hoisting the large box with his right hand.

Interacting with Dynamic Objects

A person is hoisting a trashbin with his left hand.

The individual is lifting the chairwood with his right hand.

A person holds a toolbox aloft with his right hand.

A person is bouncing on a yogaball with two hands.

A person holds a yogamat in fromt him.

The person is lifting the suitcase with his right hand.

Someone is holding a long box with his left hand.

The person is transferring the small table by pushing and pulling it on the ground.

An individual raises the backpack with his right hand.

An individual has a firm hold on the toolbox from the front.

Interacting with Static Objects

A person is perching on a stool.

A person is occuping a yogaball as seat.

A person is sitting on the chair.

Someone is seated on a square table.

Comparing to other methods

The previous text-to-motion approaches primarily concentrated on generating human motions in isolation. Thus, we adapt two conventional methods, MDM [Tevet et al. (2023)] and PriorMDM [Shafir et al. (2023)], as our baseline models for comparative evaluation concerning human-object interactions.

A person is gripping a large box at the front.

MDM*

PriorMDM*

Ours

A person is sitting on the chair.

MDM*

PriorMDM*

Ours

A person is perching on a stool.

MDM*

PriorMDM*

Ours

A person holds a yogamat in fromt him.

MDM*

PriorMDM*

Ours

Visual Results of Affordance Estimation

We provide visual results of estimated contact points. Our APDM, trained on the BEHAVE dataset, can accurately estimating contact positions for objects based on textual descriptions. Furthermore, it showcases the capability to generalize to unseen objects in the OMOMO dataset, as demonstrated in the last row.

Ablation Studies

Prompt: A person is gripping a large box at the front.

W/o CM & Interaction Correction

Without Communication Module (CM) and Interaction Correction, the model can not effectively learn spatial relations between humans and objects.

W/o Interaction Correction

Without Interaction Correction, the generated HOIs cannot follow the accurate and close human-object contacting, yielding unreal and incoherent interactions.

Full Method

The visualization validates that our complete model significantly outperforms in generating both realistic and accurate human-object interactions.

Related Works

(ICCV 2023) InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion firstly addressed the task of predicting human-object interactions based on the past observation. It leverages a diffusion model to encode the distribution of future human-object interactions and introduces a physics-informed predictor to correct denoised HOIs in a diffusion step.

(SIGGRAPH Asia 2023) Object Motion Guided Human Motion Synthesis proposes a conditional diffusion framework that can generate full-body manipulation behaviors from only the object motion. It learns two separate denoising processes to first predict hand positions from object motion and subsequently synthesize full-body poses based on the predicted hand positions.

Concurrent Works

(CVPR 2024) CG-HOI: Contact-Guided 3D Human-Object Interaction Generation learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention.

(ECCV 2024) Controllable Human-Object Interaction Synthesis generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints.

(arXiv 2023) PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction focuses on interaction with dynamic object and proposes a physics-driven whole-body approach for Human-Object Interaction imitation without task-specific reward designs.

BibTeX

@article{peng2023hoi,
        title={HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models},
        author={Peng, Xiaogang and Xie, Yiming and Wu, Zizhao and Jampani, Varun and Sun, Deqing and Jiang, Huaizu},
        journal={arXiv preprint arXiv:2312.06553},
        year={2023}
      }