DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions

Hengyuan Zhang * ¹, Zhe Li * ¹, Xingqun Qi²✉, Mengze Li², Muyi Sun³, Mang Zhang³, Sirui Han²✉

¹ Peking University, ² The Hong Kong University of Science and Technology, ³ Beijing University of Posts and Telecommunications

Paper (PDF) Code Dataset

Our DanceEditor framework, pre-trained on a large-scale dataset, enables iterative and editable dance generation that is coherently aligned with the provided music signals. The highlighted texts and avatar shadow effects here specifically indicate edits related to body movements.

Abstract

Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods enable dance synthesis directly, they overlook affording editable dance movements for users is more practical in real choreography scenes. Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge. To achieve this goal, we first construct DanceRemix, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 25.3M dance frames and 84.5K pairs. In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely DanceEditor. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions. At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored aligned music. Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specific-designed Cross-modality Editing Module (CEM). Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences. Thereby the results display music harmonic while preserving fine-grained semantic alignment with text descriptions. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected DanceRemix dataset.

Framework

Prediction-then-Editing Paradigm

In the initial prediction stage, a diffusion transformer-based Generating Branch takes music signals as input and synthesizes vivid dance motions. During the second stage, the Editing Branch that contains a Cross-modality Editing Module (CEM) adaptively incorporates the initial dance predictions with both the music and text prompts, guiding the generation of edited dance sequences.

Dataset Workflow

The workflow of DanceRemix dataset construction. Firstly, we perform motion-to-motion retrieval to obtain similar dance motion pairs. Then, we align the motion beats of the edited dance with the music beats. For aligned dance pairs, we use Gemini to generate dense dance captions for the dance videos. Next, based on the generated captions, we leverage ChatGPT to generate edit instructions. Through several motion pair retrievals, we obtain music, seed dance, a series of edit prompts, and corresponding edited dance motions.

Results

Iterative Editable Dance Generation

Our framework demonstrates superior performance in generating coherent and editable dance sequences. The results show significant improvements in both motion quality and user controllability compared to existing methods.

Demo

Citation

If you use our work in your research, please cite:

@article{
}