URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

URDF-Anything: Generating Functional URDF Digital Twins from Visual Observations(single or multi-view images). Our framework, utilizing a 3D Multimodal Large Language Model and guided by instructions (e.g., "Segment parts and predict parameters"), processes the point cloud to jointly infer geometric part segmentation and kinematic structure. The output is a segmented 3D model with defined joints (represented here by different part colors), forming a functional URDF digital twin directly usable in physics simulators.

Abstract

Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose URDF-Anything, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized [SEG] token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17% improvement), kinematic parameter prediction (average error reduction of 29%), and physical executability (surpassing baselines by 50%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim-to-real transfer capability.

Framework

Prediction-then-Editing Paradigm

Overview of the URDF-Anything Framework. The pipeline takes a 3D point cloud (from image) and a structured language instruction as input. The 3D MLLM(fine-tuned with LoRA) autoregressively generates symbolic output (kinematic parameters) and [SEG] tokens. The embeddings corresponding to the generated [SEG] tokens then interact with the point cloud features via a 3D Decoder to perform fine-grained geometric segmentation of the point cloud into individual links. Finally, the jointly predicted kinematic parameters and the segmented geometry are integrated into a functional URDF file, resulting in a complete articulated 3D model ready for physics simulation.

Experiments

Physical Executability Rate (% ↑) across methods on ID and OOD subsets.

URDF-Anything achieves a high physical executability rate, significantly surpassing baseline methods, particularly for OOD objects. This demonstrates the superior overall pipeline robustness of our approach. Compared to prior methods like Real2Code, which rely on complex, sequential pipelines where errors in one step can cascade and require manual intervention for refinement, or Articulate-Anything, which may depend on iterative refinement for parameter estimation, our framework utilizes a unified, end-to-end MLLM that jointly reasons about geometry and kinematics. This direct, integrated approach minimizes error propagation, allowing the model to leverage rich multimodal context for robust prediction of a consistent geometric and kinematic structure in a single pass.

Ablation Study on Input Modality for Joint Parameter Prediction.

This ablation study demonstrates that achieving high-fidelity 3D kinematic inference requires the combination of detailed 3D geometry and language guidance. Experiments show that 2D images are insufficient for this task, and simplified 3D representations like Oriented Bounding Boxes (OBB) lose too much crucial detail. While a detailed point cloud alone improves performance, the best results are achieved by integrating it with language instructions within a 3D Multimodal Large Language Model (MLLM), validating this combined approach as the optimal design.

Qualitative Comparison of Articulated Object Reconstruction Results.

The top row displays the input image for various articulated object instances (each column represents a different object). We can find that baseline methods frequently struggle in predicting incorrect object types, generating distorted geometry, or exhibiting significant errors in link placement, leading to misaligned or incorrect structures.