URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

Zhe Li * 1, Xiang Bai * 1, Jieyu Zhang 2, Zhuangzhe Wu1, Che Xu1, Ying Li1, Chengkai Hou1, Shanghang Zhang1

1 State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, 2 University of Washington

Arxiv Code Dataset Code
URDF-Anything 论文概念图

URDF-Anything: Generating Functional URDF Digital Twins from Visual Observations(single or multi-view images). Our framework, utilizing a 3D Multimodal Large Language Model and guided by instructions (e.g., "Segment parts and predict parameters"), processes the point cloud to jointly infer geometric part segmentation and kinematic structure. The output is a segmented 3D model with defined joints (represented here by different part colors), forming a functional URDF digital twin directly usable in physics simulators.

Abstract

Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose URDF-Anything, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized [SEG] token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17% improvement), kinematic parameter prediction (average error reduction of 29%), and physical executability (surpassing baselines by 50%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim-to-real transfer capability.

Framework

Prediction-then-Editing Paradigm

Prediction-then-Editing Paradigm

Overview of the URDF-Anything Framework. The pipeline takes a 3D point cloud (from image) and a structured language instruction as input. The 3D MLLM(fine-tuned with LoRA) autoregressively generates symbolic output (kinematic parameters) and [SEG] tokens. The embeddings corresponding to the generated [SEG] tokens then interact with the point cloud features via a 3D Decoder to perform fine-grained geometric segmentation of the point cloud into individual links. Finally, the jointly predicted kinematic parameters and the segmented geometry are integrated into a functional URDF file, resulting in a complete articulated 3D model ready for physics simulation.

Experiments

WIP

Physical Executability Rate (% ↑) across methods on ID and OOD subsets.

URDF-Anything achieves a high physical executability rate, significantly surpassing baseline methods, particularly for OOD objects. This demonstrates the superior overall pipeline robustness of our approach. Compared to prior methods like Real2Code, which rely on complex, sequential pipelines where errors in one step can cascade and require manual intervention for refinement, or Articulate-Anything, which may depend on iterative refinement for parameter estimation, our framework utilizes a unified, end-to-end MLLM that jointly reasons about geometry and kinematics. This direct, integrated approach minimizes error propagation, allowing the model to leverage rich multimodal context for robust prediction of a consistent geometric and kinematic structure in a single pass.

WIP

Ablation Study on Input Modality for Joint Parameter Prediction.

This ablation study demonstrates that achieving high-fidelity 3D kinematic inference requires the combination of detailed 3D geometry and language guidance. Experiments show that 2D images are insufficient for this task, and simplified 3D representations like Oriented Bounding Boxes (OBB) lose too much crucial detail. While a detailed point cloud alone improves performance, the best results are achieved by integrating it with language instructions within a 3D Multimodal Large Language Model (MLLM), validating this combined approach as the optimal design.

WIP

Qualitative Comparison of Articulated Object Reconstruction Results.

The top row displays the input image for various articulated object instances (each column represents a different object). We can find that baseline methods frequently struggle in predicting incorrect object types, generating distorted geometry, or exhibiting significant errors in link placement, leading to misaligned or incorrect structures.

Results

31249 image
drawer image
laptop image

Demo

Citation

If you use our work in your research, please cite:

@article{
}