Full Method
The visualization validates that our full method generates more realistic motion and follows finer details of the textual instruction.
Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.
A man steps forward, swings his leg, and turns all the way around.
A person doing a forward kick with each leg.
A man walks forward and then trips towards the right.
A person walks forward, stepping up with their right leg and down with their left, then turns to their left and walks, then turns to their left and starts stepping up.
A person fastly swimming forward.
The visualization validates that our full method generates more realistic motion and follows finer details of the textual instruction.
Without autoregressive modeling, the generated motion fails to fully align with the textual instructions.
Without Motion Representation Reformation, model outputs shaking and inaccurate motion.
@article{meng2024rethinking,
title={Rethinking Diffusion for Text-Driven Human Motion Generation},
author={Meng, Zichong and Xie, Yiming and Peng, Xiaogang and Han, Zeyu and Jiang, Huaizu},
journal={arXiv preprint arXiv:2411.16575},
year={2024}
}