Skip to yearly menu bar Skip to main content


Poster

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

Michael Psenka · Alejandro Escontrela · Pieter Abbeel · Yi Ma


Abstract:

Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we present a theoretical framework linking the structure of diffusion model policies to a learned Q-function, by linking the structure between the score of the policy to the action gradient of the Q-function. We focus on off-policy reinforcement learning and propose a new policy update method from this theory, which we denote Q-score matching. We conduct experiments in simulated environments to demonstrate the viability of our proposed method and compare to popular baselines.

Live content is unavailable. Log in and register to view live content