Rethinking Diffusion for Text-Driven Human Motion Generation

Northeastern University
Visualization Comparison

Abstract

Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.

Architecture Overview


Results Gallery

(Our method is capable of generating high-quality, textual instruction-following 3D human motions)


A person waves with both arms above head.

A man slowly walking forward.

The toon is standing, swaying a bit, then raising their left wrist as to check the time on a watch.


A man walks forward before stumbling backwards and the continues walking forward.

The person fell down and is crawling away from someone.

The sim reaches to their left and right, grabbing an object and appearing to clean it.

The man takes 4 steps backwards.

She jumps up and down, kicking her heels in the air.

A person who is standing lifts his hands and claps them four times.

A person who is running, stops, bends over, and looks down while taking small steps, then resumes running.

A person walks slowly forward holding handrail with left hand.

The person kick his left foot up and both hands up in counterclockwise circle and stop.

A person steps to the left sideways.

A person is walking across a narrow beam.

A person does a drumming movement with both hands.


Comparing to other methods

(Our method generates motion that is more realistic and more accurately follows the fine details of the textual condition)

We visually compare our method with three SOTA baseline methods: T2M-GPT[Zhang et al. (2023)], ReMoDiffuse[Zhang et al. (2023)], and MoMask[Guo et al. (2024)]. We report both ReMoDiffuse animation from raw genertaion and after their additional temporal filtering smoothing postprocess for fair comparison.







A man steps forward, swings his leg, and turns all the way around.







ReMoDiffuse with Temporal Filter Postprocess

Ours

ReMoDiffuse

T2M-GPT

MoMask









A person doing a forward kick with each leg.







ReMoDiffuse with Temporal Filter Postprocess

Ours

ReMoDiffuse

T2M-GPT

MoMask








A man walks forward and then trips towards the right.







ReMoDiffuse with Temporal Filter Postprocess

Ours

ReMoDiffuse

T2M-GPT

MoMask







A person walks forward, stepping up with their right leg and down with their left, then turns to their left and walks, then turns to their left and starts stepping up.






ReMoDiffuse with Temporal Filter Postprocess

Ours

ReMoDiffuse

T2M-GPT

MoMask








A person fastly swimming forward.








ReMoDiffuse with Temporal Filter Postprocess

Ours

ReMoDiffuse

T2M-GPT

MoMask


Temporal Editing

Visualization Comparison

Our method can be applied beyond the scope of standard text-to-motion generation to temporal editing. Here we present the temporal editing results (prefix, in-between, suffix) using our method. The input motion clips are presented without coloring (grey scale) and the edited contents are in full coloring.

Prefix

Original: "A person walks in a circular counterclockwise direction one time before returning back to his/her original position."
+ Prefix: "A person dances around."
Original: "The man takes 4 steps backwards."
+ Prefix: "The man waves both hands."

In-Between

Original: "The person fell down and is crawling away from someone."
+ In-Between: "The person jumps up and down."

Original: "A person walks ina curved line."
+ In-Between: "The person takes a small jumps."

Suffix

Original: "A person is walking across a narrow beam."
+ Suffix: "A person raises his hands."

Original: "A man rises from the ground, walks in a circle and
- sits back down on the ground."

+ Suffix: "A man starts to run."


Generation Diversity

(Our method can generate diverse motions while maintaining high quality)

The person was pushed but did not fall.

A person walks around.

A person jumps up and then lands.


Ablation Studies

Prompt: A person walks in a circular counterclockwise direction one time before returning back to his/her original position.


Full Method

The visualization validates that our full method generates more realistic motion and follows finer details of the textual instruction.

W/o Autoregressive Modeling

Without autoregressive modeling, the generated motion fails to fully align with the textual instructions.

W/o Motion Representation Reformation

Without Motion Representation Reformation, model outputs shaking and inaccurate motion.

BibTeX

@article{meng2024rethinking,
      title={Rethinking Diffusion for Text-Driven Human Motion Generation},
      author={Meng, Zichong and Xie, Yiming and Peng, Xiaogang and Han, Zeyu and Jiang, Huaizu},
      journal={arXiv preprint arXiv:2411.16575},
      year={2024}
    }