DGN: Reinforcement Learning via Implicit Imitation Guidance

Perry Dong*, Alec M. Lessing*, Annie S. Chen*, Chelsea Finn

Stanford University

Abstract

We study the problem of efficient reinforcement learning, where prior data such as demonstrations are provided for initialization in lieu of a dense reward signal. A natural approach is to incorporate an imitation learning objective, either as regularization during training or to acquire a reference policy. However, imitation learning objectives can ultimately degrade long-term performance, as it does not directly align with reward maximization. In this work, we propose to use prior data solely for guiding exploration via noise added to the policy, sidestepping the need for explicit behavior cloning constraints. The key insight in our framework, Data-Guided Noise (DGN), is that demonstrations are most useful for identifying which actions should be explored, rather than forcing the policy to take certain actions. Our approach achieves up to 2-3x improvement over prior methods for RL from offline data across seven simulated continuous control tasks.

Method Overview. DGN learns a state-conditioned noise distribution using the difference between expert actions and RL policy actions. DGN then uses this learned distribution to provide implicit imitation guidance for exploration.

Experiments

Performance on Robomimic. On all four tested Robomimic tasks, DGN consistently exceeds or matches the performance of the best baseline—even as the best baseline method varies by task. The relative benefit of DGN over RLPD and other baselines is larger on the more difficult tasks: square, tool hang.

Performance on Adroit. DGN matches or exceeds the performance of all baselines on tasks from the Adroit suite. The relative performance of DGN is best on the hardest, most long-horizon task: relocate.

Comparison to IBRL

DGN has similar performance to IBRL under optimal conditions. However, DGN is more robust. It is less susceptible to suboptimal training of a BC reference policy and better able to utilize suboptimal, multimodal demonstration data.

IBRL Comparison with good (ph) demos. DGN nearly matches the performance of IBRL using a well-trained BC reference policy. However, DGN outperforms IBRL with an undertrained BC reference policy.

IBRL Comparison with worse, multimodal (mh) demos. When using a lower-quality, multimodal (mh) dataset, DGN has higher sample efficiency than IBRL.

Full Residual Policy Learning Ablation. Learning a full residual policy (policy mean in addition to covariance) via imitation performs similarly to only learning the covariance.

BibTeX

@misc{dataguidednoise2025,
      title={Reinforcement Learning via Implicit Imitation Guidance}, 
      author={Perry Dong and Alec M. Lessing and Annie S. Chen and Chelsea Finn},
      url={https://arxiv.org/abs/2506.07505}, 
      year={2025},
      primaryClass={cs.LG},
      archivePrefix={arXiv},
      eprint={2506.07505},
}