We introduce a novel Stylized Motion Diffusion model, dubbed SMooDi, to generate stylized motion driven by content texts and style motion sequences. Unlike existing methods that either generate motion of various content or transfer style from one sequence to another, SMooDi can rapidly generate motion across a broad range of content and diverse styles. To this end, we tailor a pre-trained text-to-motion model for stylization. Specifically, we propose style guidance to ensure that the generated motion closely matches the reference style, alongside a lightweight style adaptor that directs the motion towards the desired style while ensuring realism. Experiments across various applications demonstrate that our proposed framework outperforms existing methods in stylized motion generation.
Our model generates stylized human motions from content text and a style motion sequence. At the denoising step, our model takes the content text, style motion, and noisy latent as input and predicts the noise, which is then transferred to the predicted noisy latent for the next step. This denoising step is repeated T times to obtain the noise-free motion latent, which is fed into a motion decoder to produce the final stylized motion.
@inproceedings{zhong2024smoodi,
title={SMooDi: Stylized Motion Diffusion Model},
author={Zhong, Lei and Xie, Yiming and Jampani, Varun and Sun, Deqing and Jiang, Huaizu},
booktitle={ECCV},
year={2024}
}