In-context learning has emerged as a key capability of modern neural architectures. While its impact has been significant in domains such as natural language processing, computer vision, and policy generation, its potential for system identification remains underexplored in robotics. Building upon prior work on meta-learnable dynamical modeling with Transformers, we propose a methodology for predicting end-effector poses and joint positions from torque signals for each robotic joint — without requiring prior knowledge of the system’s physical parameters — using diffusion models. In the first part of this work, we enhance the RoboMorph framework by improving dataset generation through large-scale simulation with NVIDIA Isaac Gym, which we also adopt as a baseline for comparison. We then train and compare two complementary in-context learning architectures: a Transformer-based model and a Diffuser-based model, applied to the dynamic behavior of the Franka Emika Panda and KUKA Allegro robotic platforms. To explore different ERNSI Workshop 2025 configurations of the system identification problem using Diffusers, we frame it from multiple perspectives, leveraging classifier guidance, trajectory inpainting, and receding horizon approaches for improved trajectory estimation. Our aim is to investigate the implications of this approach for online control. We demonstrate that our meta-learned models can perform fast online inference, making them suitable for real-time applications. Furthermore, we exploit the inherent flexibility of diffusion models to condition on external signals at inference time — such as controller parameters — enabling enhanced in-context system identification. We conduct extensive benchmarking across a variety of Cartesian and joint space tasks generated in Isaac Gym. Code and datasets will be released to foster reproducibility and further research.
Considering the simplified system identification problem presented on the work of Forgione et.al., the dynamic identification problem of any dynamical system may be reduced to a horizon planning problem. For the context and horizon lengths m and N and the related action-observation pairs (u - y) , the meta-learning black box system identification model is:
which is learnable with support and query datasets as with the generic meta-learning framework . Physically, the action trajectory represents either the feedforward controller torque outputs or feedback controller torque outputs, reference trajectories and gains. The corresponding observed trajectory stands for the complete kinematics in 14D joint and cartesian spaces as adopted by Bazzi et al.. The proposed input-output paradigm is as below.
Dynamical models are learnable with
For system identification, transformer based autoregressive models are superior.
Context matters: query set size is critical in metalearning. However transformer and diffuser based approaches have different sensitivities to query set size.


Transformer based approaches perform well also in basic sim2real trajectories. A major drawback is the computational requirements due to memory limitations.




Diffusion based alternatives have inference time spatial flexibility due to convolutional layers. This comes at a slight cost in performance but overall, especially for periodic signals, extrapolation is facilitated.

