Dynamical Metalearning

In-context learning has emerged as a key capability of modern neural architectures. While its impact has been significant in domains such as natural language processing, computer vision, and policy generation, its potential for system identification remains underexplored in robotics. Building upon prior work on meta-learnable dynamical modeling with Transformers, we propose a methodology for predicting end-effector poses and joint positions from torque signals for each robotic joint — without requiring prior knowledge of the system’s physical parameters — using diffusion models. In the first part of this work, we enhance the RoboMorph framework by improving dataset generation through large-scale simulation with NVIDIA Isaac Gym, which we also adopt as a baseline for comparison. We then train and compare two complementary in-context learning architectures: a Transformer-based model and a Diffuser-based model, applied to the dynamic behavior of the Franka Emika Panda and KUKA Allegro robotic platforms. To explore different ERNSI Workshop 2025 configurations of the system identification problem using Diffusers, we frame it from multiple perspectives, leveraging classifier guidance, trajectory inpainting, and receding horizon approaches for improved trajectory estimation. Our aim is to investigate the implications of this approach for online control. We demonstrate that our meta-learned models can perform fast online inference, making them suitable for real-time applications. Furthermore, we exploit the inherent flexibility of diffusion models to condition on external signals at inference time — such as controller parameters — enabling enhanced in-context system identification. We conduct extensive benchmarking across a variety of Cartesian and joint space tasks generated in Isaac Gym. Code and datasets will be released to foster reproducibility and further research.

Considering the simplified system identification problem presented on the work of Forgione et.al., the dynamic identification problem of any dynamical system may be reduced to a horizon planning problem. For the context and horizon lengths m and N and the related action-observation pairs (u - y) , the meta-learning black box system identification model is:

which is learnable with support and query datasets as with the generic meta-learning framework . Physically, the action trajectory represents either the feedforward controller torque outputs or feedback controller torque outputs, reference trajectories and gains. The corresponding observed trajectory stands for the complete kinematics in 14D joint and cartesian spaces as adopted by Bazzi et al.. The proposed input-output paradigm is as below.

input-output paradigm for query and support sets in a metalearning identification framework

Dynamical models are learnable with different neural architectures.
Autoregressive Architectures: We train 2 transformer based autoregressive architectures (sequential transformers, conditioned diffusion transformers) following the metalearning paradigm suggested in Forgione et.al.. Transformers process information sequentially with input/output pairs in an instance of time at each run through the encoder/decoder compound.

conditional diffusion transformer sketch

Non-Autoregressive Architectures: We adapt 2 diffusion based nonautoregressive architectures (inpainted diffusers, conditional diffuser) following the same metalearning paradigm. Diffuse probabilistic estimation is not directly sequential and temporal connectivity is mainly inferred from local properties.

For system identification, transformer based autoregressive models are superior.

comparison of 4 neural architectures on system identification task using MS and CH signals

estimation on cartesian position with transformer models

errors on cartesian position with transformer models

Context matters: query set size is critical in metalearning. However transformer and diffuser based approaches have different sensitivities to query set size.

rmse errors in contextual texts for 20% MS input tramsformer

rmse errors in contextual texts for 20% CH input tramsformer

Transformer based approaches perform well also in basic sim2real trajectories. A major drawback is the computational requirements due to memory limitations.

rmse errors in real transformer tests 20% ctx

rmse errors in real transformer tests 50% ctx

position prediction errors in transformer 20% ctx

position prediction errors in transformer 50% ctx

Diffusion based alternatives have inference time spatial flexibility due to convolutional layers. This comes at a slight cost in performance but overall, especially for periodic signals, extrapolation is facilitated.

rmse errors in real diffuser tests 20% ctx

position prediction errors in diffuser 20% ctx 10x horizon