Skip to yearly menu bar Skip to main content


Poster

Reflective Policy Optimization

Yaozhong Gan · yan renye · zhe wu · Junliang Xing


Abstract:

On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms monotonically improving policy performance and contracts the solution space, consequently expediting the training process. Empirical results demonstrate RPO's feasibility and efficacy in reinforcement learning benchmarks, culminating in superior sample efficiency. The source code is available in the supplementary.

Live content is unavailable. Log in and register to view live content