Evaluating Real-World Trained Policies in Simulation

Abstract

Evaluating generalist robot policies in the real world is slow and expensive, making it a major bottleneck for progress. To address this, we build a simulation-based evaluation pipeline that uses reconstruction of real-world scenes. We use the MuJoCo simulator to model the robot and its dynamics, while objects in the scene are simulated within the same physics environment. The static background is reconstructed using 3D Gaussian Splatting (3DGS) to achieve photorealistic rendering.

Within this reconstructed environment, we evaluate several representative policy architectures, including ACT, Diffusion Policy, and π_0.5. Our goal is to identify the key parameters and design choices that must be carefully controlled to ensure that policy evaluations in simulation correlate faithfully with real-world performance.

First, we match the dynamics of the simulation to the real-world. In the video below, we can see how closely it matches, though there is a constant offset because of inaccurate camera extrinsics between real and sim. We found velocity control to be more reliable than position control for this matching.

We aim to achieve trajectory-level alignment (at each state) between real and sim, although we found it hard to achieve this yet. The video below is side-bys-die comparison of ACT policy in real and sim.