OmniControl: Control Any Joint at Any Time for Human Motion Generation

Yiming Xie¹, Varun Jampani², Lei Zhong¹, Deqing Sun³, Huaizu Jiang¹

¹Northeastern University, ²Stability AI, ³Google Research

OmniControl can generate realistic human motions given a text prompt and
flexible spatial control signals.

Darker color indicates later frames in the sequence. The green line or points indicate the input control signals.

Abstract

We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals. At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion. Both the spatial and realism guidance are essential and they are highly complementary for balancing control accuracy and motion realism. By combining them, OmniControl generates motions that are realistic, coherent, and consistent with the spatial constraints. Experiments on HumanML3D and KIT-ML datasets show that OmniControl not only achieves significant improvement over state-of-the-art methods on pelvis control but also shows promising results when incorporating the constraints over other joints.

Method

Our model generates human motions from the text prompt and spatial control signal. At the denoising diffusion step, the model takes the text prompt and a noised motion sequence as input and estimates the clean motion. To incorporate flexible spatial control signals into the generation process, a hybrid guidance, consisting of realism and spatial guidance, is used to encourage motions to conform to the control signals while being realistic.

Results

(all results are generated using one model)

Dense signals on pelvis

Sparse signals on pelvis

Dense/Sparse signals on left/right wrist

Dense/Sparse signals on head

Dense/Sparse signals on left/right foot

Control multiple joints

Comparing to other methods on pelvis control

Previous methods mainly focus on incorporating global spatial control signals on pelvis. So we compare to GMD [Karunratanakul et al. (2023)] only on pelvis control. We also show the results without the spatial control signal for reference.

Two spatial points are given here as spatial constraints in the following case, and they are close to the ground so the human need to crouch to approach the point. You can see that the motion generated from GMD fails to do that.

Compared to the results without spatial constraints, the controllable motion generation can produce the motions that follow the given trajectory. The upper body motion of ours is more consistent to the input text prompt.

More comparison examples:

Ablation studies

Text prompt: a person plays a violin with their left hand in the air and their right hand holding the bow.

W/o realism guidance

Without realism guidance, the model cannot amend the rest of the joints, yielding unreal and incoherent motions.

W/o spatial guidance

Without spatial guidance, the generated motion cannot tightly follow the spatial constraints.

Gradient w.r.t x_t

We report the performance of a variant which computes the gradient w.r.t the input noised motion in spatial guidance.

Full model

The visualization validates that our full model is much more effective in terms of both realism and controlling accuracy.

Downstream application

Our method can be combined with affordance learning works which can generate the contacting area (spatial control signal) to guide the human motion generation with the surrounding objects or environment. Note that our model doesn't generate the spatial control signals. We manually give the control signals based on the the position of objects in the scene for now. All these 3D object models are licensed under Creative Commons Attribution 4.0 International License.

To sit exactly in the chair we control the pelvis position.
The chair model is from here.

To keep the left hand on the armrest and avoid the ceiling fan, we control both head and left wrist positions. The ceiling-fan model is from here. The handrail model is from here.

To exactly touch the cup we control the left wrist position.
The cup model is from here.

To exactly kick the ball we control the left foot position.
The football is from here.

With extremely unnatural spatial control signals

We show the model's tolerance for inherently unnatural guidance.

The first case is about teleportation. We give two distant control signals to adjacent frames. You can see that the human is running to the second point quickly and naturally.

The second case is the walk forward along a spiral line. The human can follow the spiral line even though this kind of spatial control signal extremely rare.

Condition on only one modality (text or spatial control signal only)

We show the results with only one modality as condition. In this case, the spatial control signal is used on the pelvis while the text is used to describe the hand activity. This results show that the model can effectively comprehend and follow instructions in one modality.

BibTeX

@inproceedings{xie2024omnicontrol,
      title={OmniControl: Control Any Joint at Any Time for Human Motion Generation},
      author={Yiming Xie and Varun Jampani and Lei Zhong and Deqing Sun and Huaizu Jiang},
      booktitle={The Twelfth International Conference on Learning Representations},
      year={2024}
}