Policy Improvement Reinforcement Learning

Huaiyang Wang*,1, Xiaojie Li*,1, Deqing Wang1, Haoyi Zhou1,
Zixuan Huang1, Yaodong Yang2, Jianxin Li1, Yikun Ban†,1
1Beihang University 2Peking University *Equal Contribution Corresponding Author

TL;DR: Stabilizing RLVR with a closed-loop algorithm that retrospectively verifies and maximizes policy improvement.

PIRL Framework

Overview of Policy Improvement Reinforcement Learning (PIRL) framework. Left: Traditional RLVR methods follow an open-loop paradigm, updating policies from instantaneous rewards without verifying actual improvement. Middle: PIRL introduces a verification stage, forming a closed-loop optimization driven by policy improvement signals. Right: During verification, positive signals are amplified, while negative signals trigger rectification to stabilize training.

Abstract

Reinforcement Learning with Verifiable Rewards has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design — updating in isolation at each step, guided only by within-group (batch) reward signals — means optimization can drift or collapse with no mechanism to detect and correct these failures.

We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance.

Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones — transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

The Vulnerability of Group-Based Policy Optimization

RLVR methods relying on group-relative advantages (like GRPO) share a critical blind spot: they optimize blindly based on intra-batch statistics. In sparse, binary reward environments, this open-loop design harbors a severe structural vulnerability.

The Empirical Symptom: Late-Stage Collapse

Practitioners often observe that standard GRPO training becomes unstable as the model's capabilities improve. Common failure modes include:

  • Inexplicable gradient norm explosions.
  • Sudden, severe collapses in Pass@1 accuracy.
  • Irrecoverable training instability requiring aggressive early stopping.
GRPO Training Crash and PIPO Stabilization

Theoretical distortion and empirical instability of GRPO. Left: As predicted by Corollary 1, standard GRPO exhibit severe sensitivity explosion at the boundaries ($p_t \to 0, 1$). Right: Standard GRPO suffers from drastic gradient norm spikes and severe Pass@1 collapse. Incorporating PIPO effectively stabilizes training.

Crucially, this instability is not an implementation bug—it is a mathematical inevitability. We formally prove that the expected gradient of GRPO is systematically distorted by the policy's current success rate.

Theorem 1: Gradient Distortion

Let $p(q;\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot \mid q)}[R(q,y)]$ denote the policy success rate. Under standard non-degenerate conditions, the expected GRPO update direction is merely a scaled version of the ideal RLVR gradient $g_{\text{ideal}}$, where the scaling factor $\eta(p_t)$ is intrinsically tied to $p_t$:

$$g_{\text{GRPO}} = \underbrace{\frac{\sum_{k=1}^{G-1}\sqrt{k(G-k)}\binom{G}{k}p^k (1-p)^{G-k}}{G\,p(1-p)\left(1 - p^G - (1-p)^G\right)}}_{\eta(p_t)} \cdot g_{\text{ideal}}$$

This scaling factor exposes a fatal flaw. When a reasoning prompt is either extremely difficult ($p \to 0$) or easy ($p \to 1$), the intra-group advantages triggers a mathematical singularity.

Corollary 1: Sensitivity Explosion

At the probability boundaries, the scaling factor $\eta(p_t)$ exhibits a symmetric singularity and diverges to infinity:

$$ \eta(p) \sim \frac{\sqrt{G-1}}{G \cdot p(1 - p)} \longrightarrow \infty. $$

The Takeaway: As the model masters a task ($p \to 1$) or struggles with a hard one ($p \to 0$), the update magnitude uncontrollably diverges. This violently amplifies stochastic noise over genuine learning signals, inevitably driving the policy into training instability.

Core Insight: Policy Improvement Reinforcement Learning (PIRL)

As established in our analysis, standard group-relative methods are inherently misaligned with RLVR. By relying solely on instantaneous, intra-batch statistics, this myopic and open-loop design leaves the gradient estimator highly vulnerable to noise and boundary singularities. To break this limitation, we propose Policy Improvement Reinforcement Learning (PIRL)—a framework that introduces policy improvement as explicit, closed-loop feedback. This drives a fundamental shift in how we optimize:

Paradigm Shift:

Instead of maximizing proxy rewards within an isolated batch, we directly optimize inter-iteration policy improvement.

Definition 1: Policy Improvement

We formally define the policy improvement at step $t$ as the difference in the ideal RLVR objective between the current policy $\theta_t$ and the previous policy $\theta_{t-1}$:

$$ \Delta J_t := J_{\mathrm{RLVR}}(\theta_t) - J_{\mathrm{RLVR}}(\theta_{t-1}) $$

By focusing on $\Delta J_t$, PIRL anchors the optimization to a historical baseline, transforming the problem into actively verifying whether the latest update actually yielded a better policy. This closed-loop shift fundamentally stabilizes the optimization trajectory without altering the ultimate goal.

Theorem 2: Objective Alignment

Maximizing the cumulative expected policy improvements over $T$ steps is mathematically equivalent to maximizing the final RLVR objective:

$$ \arg\max_{\{\theta_t\}_{t=1}^T} \sum_{t=1}^T \mathbb{E}\!\left[ \Delta J_t \right] \;=\; \arg\max_{\theta_T} J_{\mathrm{RLVR}}(\theta_T) $$

Significance: This theorem ensures that optimizing the relative step-by-step improvement $\Delta J_t$ does not alter the global optimal policy. We do not change the global optimum—we fundamentally stabilize the optimization trajectory to reach it safely.

PIPO: A Closed-Loop Realization of PIRL

To practically instantiate the PIRL paradigm, we introduce Policy Improvement Policy Optimization (PIPO). At its core is the Policy Improvement Reward ($\hat{r}^{\mathrm{PI}}$), which translates the temporal performance gain into a direct optimization signal.

We maintain a sliding-window memory of the past $K$ iterations to construct a robust historical baseline to approximate the policy improvement objective in a stable manner:

$$\mu_{\mathrm{his}} = \frac{1}{K} \sum_{k=1}^K \mu_{t-k}, \qquad \sigma_{\mathrm{his}} = \sqrt{ \frac{1}{K-1} \sum_{k=1}^K (\mu_{t-k} - \mu_{\mathrm{his}})^2 }$$

The difference between the current batch mean and this historical baseline ($\mu_t - \mu_{\mathrm{his}}$) directly measures empirical Policy Improvement $\Delta J$. Normalizing it yields a standardized improvement signal:

$$ \xi_t := \frac{\mu_t - \mu_{\mathrm{his}}}{\sigma_{\mathrm{his}}} $$

This signal acts as a clear verdict on the previous update: $\xi_t > 0$ indicates a genuine performance gain that should be reinforced, while $\xi_t \le 0$ exposes a suppression driven by noise, triggering immediate rectification. By fusing this global verification signal with local sample attribution, we construct the core reward mechanism of PIPO:

Definition 2: Policy Improvement Reward

For each historical sample $y_{t-1,i}$ generated during the previous iteration $t-1$, the Policy Improvement (PI) Reward evaluated at iteration $t$ is constructed as:

$$ \hat{r}^{\mathrm{PI}}_{t,i} := \underbrace{ \frac{G \cdot A(y_{t-1,i})} {\sum_{j=1}^G \lvert A(y_{t-1,j}) \rvert} }_{\text{Local Attribution}} \;\cdot\; \underbrace{\phi_\lambda \left(\frac{\mu_t - \mu_{\mathrm{his}}}{\sigma_{\mathrm{his}}} \right)}_{\text{Global Verification}} $$

Dual-Stage Optimization Process

To implement the temporal feedback loop, PIPO weaves exploration and verification across consecutive iterations. We illustrate this closed-loop mechanism by tracing the chronological lifecycle of an optimization step spanning iteration $t$ and $t+1$:

Phase 1: Forward Exploration
(at Iteration $t$)

Given the current verified policy $\theta_t$, we generate a fresh batch $\mathcal{B}_t$ and perform a standard exploratory update to obtain the next policy $\theta_{t+1}$:

$$ \theta_{t+1} \leftarrow \theta_t + \alpha_{\mathrm{std}} \cdot \nabla_\theta \mathcal{J}_{\mathrm{group}}(\theta_t;\mathcal{B}_t) $$

Here, $\mathcal{J}_{\mathrm{group}}$ denotes a standard spatial optimization objective. Since PIPO serves as a model-agnostic meta-framework, $\mathcal{J}_{\mathrm{group}}$ can be flexibly instantiated with standard GRPO, or seamlessly integrated with variants like GSPO and DAPO.

Phase 2: Retrospective Verification
(at Iteration $t+1$)

The new policy $\theta_{t+1}$ evaluates a new batch $\mathcal{B}_{t+1}$. Its mean $\mu_{t+1}$ acts as empirical evidence. Using the PI reward derived from $\mu_{t+1}$, we retrospectively verify the historical batch $\mathcal{B}_t$:

$$ \begin{aligned} \mathcal{J}_{\mathrm{PI}} &= \frac{1}{G} \sum_{i=1}^G \min\!\Big( \nu_i \hat{r}^{\mathrm{PI}}_{t+1,i},\; \mathrm{clip}(\dots) \Big) \\ \theta'_{t+1} &\leftarrow \theta_{t+1} + \alpha_{\mathrm{PI}} \cdot \nabla_\theta \mathcal{J}_{\mathrm{PI}}(\theta_{t+1};\mathcal{B}_{t}) \end{aligned} $$

This acts as a corrective gate: updates from step $t$ verified to improve performance ($\xi_{t+1}>0$) are reinforced, while regressions ($\xi_{t+1}<0$) are actively rectified.

Why PIPO Works: Theoretical Guarantees

This forms a self-sustaining loop that explicitly verifies policy improvement. We theoretically prove this dual-stage mechanism guarantees three critical properties:

  • Objective Alignment: The retrospective PI-update performs first-order ascent on the ideal expected policy improvement, ensuring optimization is driven by genuine capability gains rather than batch noise.
  • Geometric Rectification: By gating updates with the global verification signal $\phi$, PIPO effectively neutralizes the gradient sensitivity explosion at extreme probability boundaries.
  • Dual-Mode Regulation: The combined update dynamically adapts. It acts as a momentum booster ($\Delta J > 0$) to accelerate beneficial directions, and a braking mechanism ($\Delta J < 0$) to actively rectify detrimental ones.
PIPO Framework Algorithm

PIPO Framework Algorithm. The continuous interleaving of retrospective verification and forward exploration forms a self-sustaining loop that effectively balances aggressive discovery with verifiably stable improvement.

Experiments

Main Results

PIPO is a plug-and-play framework. We integrate it atop standard group-based RL baselines (GRPO, GSPO, and DAPO) across Qwen3-4B-Base and Qwen3-8B-Base. Our math benchmarks include MATH500, AIME2025, AMC2023, Minerva and OlympiadBench.

Main Results on Mathematical Reasoning

Main Results on Mathematical Reasoning. PIPO consistently improves the performance of group-based RL algorithms across all benchmarks and model scales.

Training Dynamics

Training Dynamics. PIPO outperforms on accuracy and response length across training steps on Qwen3-4B-Base.

Robust Multi-Sample Reasoning

Standard open-loop methods are highly sensitive to sampling noise. By actively detecting and suppressing detrimental updates, PIPO dynamically rectifies the optimization trajectory. This retrospective filtering maintains a robust exploration space, consistently elevating multi-sample reasoning (Pass@8) across all baselines.

Pass@8 Results

Pass@8 Results for Qwen3-4B-Base. PIPO consistently enhances multi-sample reasoning performance across all baselines, demonstrating its robustness to sampling noise and its ability to maintain a stable optimization trajectory.

Scientific Reasoning Capability

We evaluate PIPO with Qwen3-4B-Base on SciKnowEval to assess its performance on knowledge-rich and logic-intensive subjects across Physics, Biology, Chemistry, and Material Science.

SciKnowEval Results

SciKnowEval Results. PIPO enhances complex reasoning capability on SciKnowEval tasks, demonstrating that our closed-loop mechanism effectively enhances logical deduction in knowledge-dense environments.

Ablation Studies

We investigate the sensitivity of PIPO to its core design choices: the historical window size ($K$) and the rectification strength ($\lambda$).

Historical Window Size ($K$)

Ablation on Window Size

The sliding window $K$ controls the historical baseline. While smaller windows are susceptible to batch noise and larger ones introduce stale data, $K=8$ emerges as a robust equilibrium, effectively balancing stable estimation with rapid responsiveness.

Rectification Strategy ($\lambda$)

Ablation on Rectification

$\lambda$ determines the suppression strength applied to negative improvement signals. Hard suppression ($\lambda=1$) overly restricts exploration, whereas no suppression ($\lambda=0$) fails to filter noise. A soft suppression ($\lambda=0.1$) achieves the optimal balance, gently dampening harmful updates without limiting the exploration space.

Wall-Clock Efficiency

Does retrospective verification slow down training? While PIPO introduces a ~45% time overhead, this cost is strictly amortized by its superior sample efficiency. By filtering detrimental updates, PIPO prevents wasted exploration and converges to a significantly higher accuracy plateau that standard methods cannot reach, regardless of how long they train.

Training Efficiency

Training Efficiency Analysis. PIPO achieves target accuracies in significantly less total wall-clock time compared to standard GRPO, proving its practical efficiency.

Conclusion

In this work, we address a limitation of existing RLVR: optimization relies on step-wise advantage signals without explicitly verifying whether updates yield genuine policy improvement over time. This objective mismatch causes gradient sensitivity explosion and training instability, in sparse-reward reasoning tasks.

To address this, we introduce PIRL (Policy Improvement Reinforcement Learning), which reframes post-training as the maximization of expected inter-iteration performance gains. We further propose PIPO (Policy Improvement Policy Optimization), a closed-loop framework that operationalizes PIRL through a retrospective verification mechanism.

Key Contributions:

  • Novel Problem: We identify the absence of policy improvement feedback in group-based RLVR methods as a root cause of instability in sparse-reward regimes, and introduce Policy Improvement Reinforcement Learning (PIRL), which directly optimizes expected inter-iteration policy improvement, realigning optimization with the ideal RLVR objective.
  • Proposed Algorithm: We propose Policy Improvement Policy Optimization(PIPO), a practical algorithm that operationalizes PIRL through retrospective verification and a parameter-free Policy Improvement Reward, reinforcing genuine improvements while suppressing detrimental updates.
  • Empirical Effectiveness: Experiments on mathematical reasoning benchmarks demonstrate that PIPO consistently outperforms GRPO and its variants, exhibiting smoother training dynamics and improved robustness against training instability.

PIPO provides a highly stable and effective alignment strategy, demonstrating that explicitly verifying policy improvement can fundamentally resolve the instability of standard group-based RL algorithms and robustly improve reasoning performance.

BibTeX


          @article{wang2026policyimprovement,
            title={Policy Improvement Reinforcement Learning},
            author={Wang, Huaiyang and Li, Xiaojie and Wang, Deqing and Zhou, Haoyi and Huang, Zixuan and Yang, Yaodong and Li, Jianxin and Ban, Yikun},
            journal={arXiv preprint arXiv:2604.00860},
            year={2026},
          }