To practically instantiate the PIRL paradigm, we introduce Policy Improvement Policy Optimization (PIPO). At its core is the Policy Improvement Reward ($\hat{r}^{\mathrm{PI}}$), which translates the temporal performance gain into a direct optimization signal.
We maintain a sliding-window memory of the past $K$ iterations to construct a robust historical baseline to approximate the policy improvement objective in a stable manner:
The difference between the current batch mean and this historical baseline ($\mu_t - \mu_{\mathrm{his}}$) directly measures empirical Policy Improvement $\Delta J$. Normalizing it yields a standardized improvement signal:
This signal acts as a clear verdict on the previous update: $\xi_t > 0$ indicates a genuine performance gain that should be reinforced, while $\xi_t \le 0$ exposes a suppression driven by noise, triggering immediate rectification. By fusing this global verification signal with local sample attribution, we construct the core reward mechanism of PIPO:
Definition 2: Policy Improvement Reward
For each historical sample $y_{t-1,i}$ generated during the previous iteration $t-1$, the Policy Improvement (PI) Reward evaluated at iteration $t$ is constructed as:
Dual-Stage Optimization Process
To implement the temporal feedback loop, PIPO weaves exploration and verification across consecutive iterations. We illustrate this closed-loop mechanism by tracing the chronological lifecycle of an optimization step spanning iteration $t$ and $t+1$:
Phase 1: Forward Exploration
(at Iteration $t$)
Given the current verified policy $\theta_t$, we generate a fresh batch $\mathcal{B}_t$ and perform a standard exploratory update to obtain the next policy $\theta_{t+1}$:
$$
\theta_{t+1} \leftarrow \theta_t + \alpha_{\mathrm{std}} \cdot \nabla_\theta \mathcal{J}_{\mathrm{group}}(\theta_t;\mathcal{B}_t)
$$
Here, $\mathcal{J}_{\mathrm{group}}$ denotes a standard spatial optimization objective. Since PIPO serves as a model-agnostic meta-framework, $\mathcal{J}_{\mathrm{group}}$ can be flexibly instantiated with standard GRPO, or seamlessly integrated with variants like GSPO and DAPO.
Phase 2: Retrospective Verification
(at Iteration $t+1$)
The new policy $\theta_{t+1}$ evaluates a new batch $\mathcal{B}_{t+1}$. Its mean $\mu_{t+1}$ acts as empirical evidence. Using the PI reward derived from $\mu_{t+1}$, we retrospectively verify the historical batch $\mathcal{B}_t$:
$$
\begin{aligned}
\mathcal{J}_{\mathrm{PI}} &= \frac{1}{G} \sum_{i=1}^G \min\!\Big( \nu_i \hat{r}^{\mathrm{PI}}_{t+1,i},\; \mathrm{clip}(\dots) \Big) \\
\theta'_{t+1} &\leftarrow \theta_{t+1} + \alpha_{\mathrm{PI}} \cdot \nabla_\theta \mathcal{J}_{\mathrm{PI}}(\theta_{t+1};\mathcal{B}_{t})
\end{aligned}
$$
This acts as a corrective gate: updates from step $t$ verified to improve performance ($\xi_{t+1}>0$) are reinforced, while regressions ($\xi_{t+1}<0$) are actively rectified.
Why PIPO Works: Theoretical Guarantees
This forms a self-sustaining loop that explicitly verifies policy improvement. We theoretically prove this dual-stage mechanism guarantees three critical properties:
-
Objective Alignment: The retrospective PI-update performs first-order ascent on the ideal expected policy improvement, ensuring optimization is driven by genuine capability gains rather than batch noise.
-
Geometric Rectification: By gating updates with the global verification signal $\phi$, PIPO effectively neutralizes the gradient sensitivity explosion at extreme probability boundaries.
-
Dual-Mode Regulation: The combined update dynamically adapts. It acts as a momentum booster ($\Delta J > 0$) to accelerate beneficial directions, and a braking mechanism ($\Delta J < 0$) to actively rectify detrimental ones.