Policy Improvement Reinforcement Learning

Overview of Policy Improvement Reinforcement Learning (PIRL) framework. Left: Traditional RLVR methods follow an open-loop paradigm, updating policies from instantaneous rewards without verifying actual improvement. Middle: PIRL introduces a verification stage, forming a closed-loop optimization driven by policy improvement signals. Right: During verification, positive signals are amplified, while negative signals trigger rectification to stabilize training.

Abstract

Reinforcement Learning with Verifiable Rewards has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design — updating in isolation at each step, guided only by within-group (batch) reward signals — means optimization can drift or collapse with no mechanism to detect and correct these failures.

We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance.

Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones — transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

The Vulnerability of Group-Based Policy Optimization

RLVR methods relying on group-relative advantages (like GRPO) share a critical blind spot: they optimize blindly based on intra-batch statistics. In sparse, binary reward environments, this open-loop design harbors a severe structural vulnerability.

The Empirical Symptom: Late-Stage Collapse

Practitioners often observe that standard GRPO training becomes unstable as the model's capabilities improve. Common failure modes include:

Inexplicable gradient norm explosions.
Sudden, severe collapses in Pass@1 accuracy.
Irrecoverable training instability requiring aggressive early stopping.

GRPO Training Crash and PIPO Stabilization

Theoretical distortion and empirical instability of GRPO. Left: As predicted by Corollary 1, standard GRPO exhibit severe sensitivity explosion at the boundaries ($p_t \to 0, 1$). Right: Standard GRPO suffers from drastic gradient norm spikes and severe Pass@1 collapse. Incorporating PIPO effectively stabilizes training.

Crucially, this instability is not an implementation bug—it is a mathematical inevitability. We formally prove that the expected gradient of GRPO is systematically distorted by the policy's current success rate.

Theorem 1: Gradient Distortion

Let $p(q;\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot \mid q)}[R(q,y)]$ denote the policy success rate. Under standard non-degenerate conditions, the expected GRPO update direction is merely a scaled version of the ideal RLVR gradient $g_{\text{ideal}}$, where the scaling factor $\eta(p_t)$ is intrinsically tied to $p_t$:

$$g_{\text{GRPO}} = \underbrace{\frac{\sum_{k=1}^{G-1}\sqrt{k(G-k)}\binom{G}{k}p^k (1-p)^{G-k}}{G\,p(1-p)\left(1 - p^G - (1-p)^G\right)}}_{\eta(p_t)} \cdot g_{\text{ideal}}$$

This scaling factor exposes a fatal flaw. When a reasoning prompt is either extremely difficult ($p \to 0$) or easy ($p \to 1$), the intra-group advantages triggers a mathematical singularity.

Corollary 1: Sensitivity Explosion

At the probability boundaries, the scaling factor $\eta(p_t)$ exhibits a symmetric singularity and diverges to infinity:

$$ \eta(p) \sim \frac{\sqrt{G-1}}{G \cdot p(1 - p)} \longrightarrow \infty. $$

The Takeaway: As the model masters a task ($p \to 1$) or struggles with a hard one ($p \to 0$), the update magnitude uncontrollably diverges. This violently amplifies stochastic noise over genuine learning signals, inevitably driving the policy into mode collapse.

Core Insight: Policy Improvement Reinforcement Learning (PIRL)

As established in our analysis, standard group-relative methods are inherently misaligned with RLVR. By relying solely on instantaneous, intra-batch statistics, this myopic and open-loop design leaves the gradient estimator highly vulnerable to noise and boundary singularities. To break this limitation, we propose Policy Improvement Reinforcement Learning (PIRL)—a framework that introduces policy improvement as explicit, closed-loop feedback. This drives a fundamental shift in how we optimize:

Paradigm Shift:

Instead of maximizing proxy rewards within an isolated batch, we directly optimize inter-iteration policy improvement.

Definition 1: Policy Improvement

We formally define the policy improvement at step $t$ as the difference in the ideal RLVR objective between the current policy $\theta_t$ and the previous policy $\theta_{t-1}$:

$$ \Delta J_t := J_{\mathrm{RLVR}}(\theta_t) - J_{\mathrm{RLVR}}(\theta_{t-1}) $$

By focusing on $\Delta J_t$, PIRL anchors the optimization to a historical baseline, transforming the problem into actively verifying whether the latest update actually yielded a better policy. This closed-loop shift fundamentally stabilizes the optimization trajectory without altering the ultimate goal.

Theorem 2: Objective Alignment

Maximizing the cumulative expected policy improvements over $T$ steps is mathematically equivalent to maximizing the final RLVR objective:

$$ \arg\max_{\{\theta_t\}_{t=1}^T} \sum_{t=1}^T \mathbb{E}\!\left[ \Delta J_t \right] \;=\; \arg\max_{\theta_T} J_{\mathrm{RLVR}}(\theta_T) $$

Significance: This theorem ensures that optimizing the relative step-by-step improvement $\Delta J_t$ does not alter the global optimal policy. We do not change the global optimum—we fundamentally stabilize the optimization trajectory to reach it safely.

PIPO: A Closed-Loop Realization of PIRL

To practically instantiate the PIRL paradigm, we introduce Policy Improvement Policy Optimization (PIPO). At its core is the Policy Improvement Reward ($\hat{r}^{\mathrm{PI}}$), which translates the temporal performance gain into a direct optimization signal.

We maintain a sliding-window memory of the past $K$ iterations to construct a robust historical baseline to approximate the policy improvement objective in a stable manner:

$$\mu_{\mathrm{his}} = \frac{1}{K} \sum_{k=1}^K \mu_{t-k}, \qquad \sigma_{\mathrm{his}} = \sqrt{ \frac{1}{K-1} \sum_{k=1}^K (\mu_{t-k} - \mu_{\mathrm{his}})^2 }$$

The difference between the current batch mean and this historical baseline ($\mu_t - \mu_{\mathrm{his}}$) directly measures empirical Policy Improvement $\Delta J$. Normalizing it yields a standardized improvement signal:

$$ \xi_t := \frac{\mu_t - \mu_{\mathrm{his}}}{\sigma_{\mathrm{his}}} $$

This signal acts as a clear verdict on the previous update: $\xi_t > 0$ indicates a genuine performance gain that should be reinforced, while $\xi_t \le 0$ exposes a suppression driven by noise, triggering immediate rectification. By fusing this global verification signal with local sample attribution, we construct the core reward mechanism of PIPO:

Definition 2: Policy Improvement Reward

For each historical sample $y_{t-1,i}$ generated during the previous iteration $t-1$, the Policy Improvement (PI) Reward evaluated at iteration $t$ is constructed as:

$$ \hat{r}^{\mathrm{PI}}_{t,i} := \underbrace{ \frac{G \cdot A(y_{t-1,i})} {\sum_{j=1}^G \lvert A(y_{t-1,j}) \rvert} }_{\text{Local Attribution}} \;\cdot\; \underbrace{\phi \left(\frac{\mu_t - \mu_{\mathrm{his}}}{\sigma_{\mathrm{his}}} \right)}_{\text{Global Verification}} $$

Dual-Stage Optimization Process

To implement the temporal feedback loop, PIPO weaves exploration and verification across consecutive iterations. We illustrate this closed-loop mechanism by tracing the chronological lifecycle of an optimization step spanning iteration $t$ and $t+1$:

Phase 1: Forward Exploration
(at Iteration $t$)

Given the current verified policy $\theta_t$, we generate a fresh batch $\mathcal{B}_t$ and perform a standard exploratory update to obtain the next policy $\theta_{t+1}$:

$$ \theta_{t+1} \leftarrow \theta_t + \alpha_{\mathrm{std}} \cdot \nabla_\theta \mathcal{J}_{\mathrm{group}}(\theta_t;\mathcal{B}_t) $$

Here, $\mathcal{J}_{\mathrm{group}}$ denotes a standard spatial optimization objective. Since PIPO serves as a model-agnostic meta-framework, $\mathcal{J}_{\mathrm{group}}$ can be flexibly instantiated with standard GRPO, or seamlessly integrated with variants like GSPO and DAPO.

Phase 2: Retrospective Verification
(at Iteration $t+1$)

The new policy $\theta_{t+1}$ evaluates a new batch $\mathcal{B}_{t+1}$. Its mean $\mu_{t+1}$ acts as empirical evidence. Using the PI reward derived from $\mu_{t+1}$, we retrospectively verify the historical batch $\mathcal{B}_t$:

$$ \begin{aligned} \mathcal{J}_{\mathrm{PI}} &= \frac{1}{G} \sum_{i=1}^G \min\!\Big( \nu_i \hat{r}^{\mathrm{PI}}_{t+1,i},\; \mathrm{clip}(\dots) \Big) \\ \theta'_{t+1} &\leftarrow \theta_{t+1} + \alpha_{\mathrm{PI}} \cdot \nabla_\theta \mathcal{J}_{\mathrm{PI}}(\theta_{t+1};\mathcal{B}_{t}) \end{aligned} $$

This acts as a corrective gate: updates from step $t$ verified to improve performance ($\xi_{t+1}>0$) are reinforced, while regressions ($\xi_{t+1}<0$) are actively rectified.

Why PIPO Works: Theoretical Guarantees

This forms a self-sustaining loop that explicitly verifies policy improvement. We theoretically prove this dual-stage mechanism guarantees three critical properties:

Objective Alignment: The retrospective PI-update performs first-order ascent on the ideal expected policy improvement, ensuring optimization is driven by genuine capability gains rather than batch noise.
Geometric Rectification: By gating updates with the global verification signal $\phi$, PIPO effectively neutralizes the gradient sensitivity explosion at extreme probability boundaries.
Dual-Mode Regulation: The combined update dynamically adapts. It acts as a momentum booster ($\Delta J > 0$) to accelerate beneficial directions, and a braking mechanism ($\Delta J < 0$) to actively rectify detrimental ones.

PIPO Framework Algorithm. The continuous interleaving of retrospective verification and forward exploration forms a self-sustaining loop that effectively balances aggressive discovery with verifiably stable improvement.

Experiments

Main Results

PIPO is a plug-and-play framework. We integrate it atop standard group-based RL baselines (GRPO, GSPO, and DAPO) across Qwen3-4B-Base and Qwen3-8B-Base. Our math benchmarks include MATH500, AIME2025, AMC23, and MINERVA.

Main Results on Mathematical Reasoning. PIPO consistently improves the performance of group-based RL algorithms across all benchmarks and model scales.

Training Dynamics. PIPO outperforms on accuracy and response length across training steps.

Ability Beyond Math

We evaluate PIPO with Qwen3-4B-Base on SciKnowEval to assess its performance on knowledge-rich and logic-intensive subjects across Physics, Biology, Chemistry, and Material Science.

SciKnowEval Results. PIPO enhances complex reasoning capability beyond math datasets, demonstrating that improvement verification is a general alignment principle — not task-specific tuning.

Ablation Studies

We investigate the sensitivity of PIPO to its core design choices: the historical window size ($K$) and the rectification strategy ($\phi$).

Historical Window Size ($K$)

Smaller windows yield higher peak performance but high variance, whereas larger windows underestimate rapid improvement. $K=8$ achieves the optimal balance between rapid adaptation and stable estimation.

Rectification Strategy ($\phi$)

The Unidirectional setting ([0, 0.5]) yields the best performance. Explicitly penalizing negative improvement signals (bidirectional) can introduce additional instability in sparse-reward tasks.

Computational Efficiency

PIPO introduces a retrospective pass, adding modest computational overhead per step. However, owing to highly stabilized gradients and avoidance of destructive updates, PIPO achieves significantly higher sample efficiency and faster wall-clock convergence.

Training Efficiency Analysis. PIPO adds 12.0% to 19.3% computational overhead per step. Despite this, PIPO achieves target accuracies in significantly less total wall-clock time compared to standard GRPO, proving its practical efficiency.

Conclusion

In this work, we address a limitation of existing RLVR: optimization relies on step-wise advantage signals without explicitly verifying whether updates yield genuine policy improvement over time. This objective mismatch causes gradient sensitivity explosion and mode collapse in sparse-reward reasoning tasks.

To address this, we introduce PIRL (Policy Improvement Reinforcement Learning), which reframes post-training as the maximization of expected inter-iteration performance gains. We further propose PIPO (Policy Improvement Policy Optimization), a closed-loop framework that operationalizes PIRL through a retrospective verification mechanism.

Key Contributions:

Novel Problem: We identify the absence of policy improvement feedback in group-based RLVR methods as a root cause of instability in sparse-reward regimes, and introduce Policy Improvement Reinforcement Learning (PIRL), which directly optimizes expected inter-iteration policy improvement, realigning optimization with the ideal RLVR objective.
Proposed Algorithm: We propose Policy Improvement Policy Optimization(PIPO), a practical algorithm that operationalizes PIRL through retrospective verification and a parameter-free Policy Improvement Reward, reinforcing genuine improvements while suppressing detrimental updates.
Empirical Effectiveness: Experiments on mathematical reasoning benchmarks demonstrate that PIPO consistently outperforms GRPO and its variants, exhibiting smoother training dynamics and improved robustness against mode collapse.

PIPO provides a highly stable and effective alignment strategy, demonstrating that explicitly verifying policy improvement can fundamentally resolve the instability of standard group-based RL algorithms and robustly improve reasoning performance.

BibTeX


          @article{wang2026policyimprovement,
            title={Policy Improvement Reinforcement Learning},
            author={Wang, Huaiyang and Li, Xiaojie and Wang, Deqing and Zhou, Haoyi and Huang, Zixuan and Yang, Yaodong and Li, Jianxin and Ban, Yikun},
            journal={arXiv preprint arXiv:2604.00860},
            year={2026},
          }