Policy Improvement Reinforcement Learning

Open-loop RL and closed-loop Policy Improvement RL framework

PIPO reinforcement and suppression cases

Overview of Policy Improvement Reinforcement Learning. Traditional RL post-training optimizes local signals without checking whether the policy transition improved the model. PIRL introduces explicit inter-iteration verification; PIPO uses this feedback to amplify progress and damp regressions.

Abstract

Reinforcement learning has become a central post-training paradigm for improving LLM and agent capabilities. Yet existing RL post-training methods share a common blind spot: they construct local learning signals from sampled trajectories, rewards, or feedback-conditioned targets, then update the policy without explicitly verifying whether the resulting policy outperforms its predecessor. Optimizing these local signals does not necessarily produce a better policy, while finite sampling, generation stochasticity and feedback noise can further widen this gap.

We argue that the missing ingredient is policy improvement feedback: the ability to measure progress across policy iterations. We introduce Policy Improvement Reinforcement Learning (PIRL), which formulates inter-iteration performance gain as an explicit objective structurally aligned with final task performance.

Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), a plug-in closed-loop framework that verifies the previous update against a sliding-window historical performance anchor. PIPO uses this improvement feedback to modulate the local learning signal of the base policy optimization algorithm, reinforcing updates associated with measured progress and suppressing those associated with performance drops.

We provide theoretical evidence that PIPO locally aligns policy updates with the PIRL improvement objective. Experiments on mathematical reasoning, code, tool-use, and self-distillation settings show that PIPO yields consistent gains across PPO, group-relative, and self-distillation policy optimization families.

Math Reasoning

+2.8 AVG

Qwen3-4B DAPO improves from 46.4 to 49.2 Pass@1.

Model Scale

4B + 8B

Consistent gains across both Qwen3-4B-Base and Qwen3-8B-Base.

Task Families

4 Domains

Math, code, tool-use, and self-distillation evaluations.

Best SDPO AVG

73.8

PIPO improves SDPO from 69.9 to 73.8 on SciKnowEval.

PIRL: From Open Loop to Closed Loop

Existing RL post-training algorithms differ in how they assign local credit, but most optimize each update in isolation. PIRL changes the question from "which samples look good in the current batch?" to "did the latest policy transition produce measurable progress?"

This is useful because the local learning signal and the actual effect of a policy update are not the same object. A batch may contain noisy verifier outcomes, stochastic generations, or difficulty shifts; after the optimizer follows that signal, the updated policy may or may not perform better. PIRL treats this missing inter-iteration evidence as a first-class feedback signal rather than an after-the-fact diagnostic.

Formally, PIRL measures the improvement produced by a transition from $\theta_t$ to $\theta_{t+1}$ and optimizes the sum of these gains over the training trajectory. Because the cumulative telescoping improvement is aligned with the final policy performance, this temporal objective keeps the same destination as standard RL while giving the optimizer a way to reason about whether each step was actually helpful.

Policy Improvement

The step-wise gain is defined as $\Delta J_t := J_{\mathrm{RL}}(\theta_{t+1}) - J_{\mathrm{RL}}(\theta_t)$.

PIRL Objective

PIRL maximizes cumulative expected improvement, $\sum_{t=1}^{T}\mathbb{E}[\Delta J_t]$, over a sequence of policies.

Objective Alignment

For fixed initialization, maximizing cumulative policy improvement exactly maximizes final policy performance.

$$ \arg\max_{\{\theta_t\}_{t=1}^{T}} \sum_{t=1}^{T}\mathbb{E}[\Delta J_t] = \arg\max_{\theta_T} J_{\mathrm{RL}}(\theta_T) $$

Key shift: PIRL does not replace task rewards or local attribution. It adds policy-improvement feedback revealed by subsequent batches, closing the loop around whether an update actually helped.

PIPO: A Plug-In Realization

PIPO keeps the local credit assignment of a base RL algorithm, then adds a global verification signal that measures whether the previous policy transition improved empirical performance. It can be attached to PPO, group-relative methods such as GRPO, GSPO, and DAPO, and self-distillation methods such as SDPO.

At iteration $t$, PIPO first samples a fresh batch from the current policy and estimates its empirical task performance with the mean verifier reward $\mu_t$. This value is not used as a standalone reward for the current batch; instead, it serves as evidence about the effect of the previous policy update. PIPO compares $\mu_t$ with a short sliding-window anchor formed from the past $K$ iterations, which makes the comparison less sensitive to a single noisy batch.

The resulting standardized signal $\xi_t$ is a compact verdict on the previous transition. If $\xi_t > 0$, the update appears to have improved the policy and the previous local attributions are reinforced. If $\xi_t < 0$, the update appears harmful and PIPO softly suppresses it through the rectification coefficient $\lambda$. In this way, PIPO does not need to redesign PPO, GRPO, GSPO, DAPO, or SDPO; it modulates their own attribution signals with a policy-improvement check.

Policy Improvement Reward

$$ \mu_t := \frac{1}{|\mathcal{B}_t|}\sum_{(q,y)\in\mathcal{B}_t} R(q,y), \qquad y \sim \pi_{\theta_t}(\cdot \mid q) $$

$$ \mu_{\mathrm{his}} = \frac{1}{K} \sum_{k=1}^{K}\mu_{t-k}, \qquad \xi_t := \frac{\mu_t - \mu_{\mathrm{his}}}{\sigma_{\mathrm{his}}} $$

$$ \hat{r}^{\mathrm{PI}}_{t,i} := a_{t-1,i} \cdot \phi_{\lambda}(\xi_t), \qquad \phi_{\lambda}(x)= \begin{cases} x & x \ge 0,\\ \lambda x & x < 0. \end{cases} $$

Phase 1: Exploration

The base algorithm samples the current batch and performs its usual local update from rewards, advantages, or distillation targets.

Phase 2: Verification

The next batch estimates whether that update improved the policy. Positive feedback reinforces the previous update; negative feedback triggers rectification.

The local attribution $a_{t-1,i}$ decides where to assign credit. For PPO and SDPO, PIPO can use the base method's original advantage or token-level attribution; for group-relative methods, it uses normalized group-relative attribution. The scalar feedback $\phi_{\lambda}(\xi_t)$ decides whether the previous direction is supported by measured policy improvement.

Theoretical Lens

The paper provides a general approximate PIRL-ascent argument and uses group-relative optimization as an illustrative sensitivity case. In sparse-reward regimes, group-relative normalization can apply a state-dependent scaling to the ideal gradient and become sensitive near probability boundaries.

Boundary sensitivity and empirical instability of group-relative optimization

Boundary sensitivity and empirical stability. Group-relative optimization can amplify rare non-degenerate signals near $p_t \to 0$ or $p_t \to 1$. PIPO checks the realized policy effect and modulates updates with policy-improvement feedback.

Experiments

We evaluate PIPO as a plug-and-play module across two Qwen3 model scales and four task families. All comparisons use the same evaluator within each setting; improvements are reported over the corresponding open-loop baseline.

The experiments are organized around three questions: whether policy-improvement feedback consistently improves different policy optimization families, whether the gains transfer beyond mathematical reasoning, and whether the closed-loop mechanism remains robust under multi-sample decoding, random data seeds, and different design choices.

Mathematical Reasoning: Qwen3-4B-Base

We first evaluate on five mathematical reasoning benchmarks after training on MATH. On Qwen3-4B-Base, PIPO improves the average Pass@1 score for PPO, GRPO, GSPO, and DAPO. The largest gain appears on DAPO, where closed-loop verification raises the average from 46.4 to 49.2.

Method	MATH500	AIME25	AMC23	Minerva	Olympiad	AVG
Base Model	58.2	7.4	45.0	14.0	28.6	30.6
PPO	80.3	11.1	62.5	25.0	41.2	44.0
+ PIPO	80.1	18.5	65.0	25.7	39.7	45.8
GRPO	79.3	18.5	60.0	21.0	40.5	43.9
+ PIPO	80.9	18.5	60.0	26.8	44.4	46.1
GSPO	78.5	18.5	62.5	23.2	39.8	44.5
+ PIPO	80.3	22.2	60.0	24.6	41.8	45.8
DAPO	81.3	22.2	65.0	21.7	41.8	46.4
+ PIPO	82.7	22.2	70.0	25.0	46.3	49.2

Mathematical Reasoning: Qwen3-8B-Base

The same trend holds at the larger 8B scale. PIPO improves every base optimizer on average, with particularly clear gains for GSPO and DAPO. This suggests that the policy-improvement signal is not tied to one model size or one specific group-relative variant.

Method	MATH500	AIME25	AMC23	Minerva	Olympiad	AVG
Base Model	65.0	11.1	45.0	17.3	31.1	33.9
PPO	81.3	18.5	67.5	25.4	43.0	47.1
+ PIPO	83.7	18.5	70.0	28.3	43.3	48.8
GRPO	80.1	22.2	67.5	27.6	42.4	48.0
+ PIPO	82.3	22.2	67.5	29.0	43.6	48.9
GSPO	81.7	22.2	67.5	26.8	45.5	48.7
+ PIPO	83.5	25.9	72.5	28.3	47.5	51.5
DAPO	85.3	25.9	75.0	27.9	48.7	52.6
+ PIPO	86.3	29.6	75.0	30.9	52.2	54.8

Training Dynamics

Final accuracy is only part of the story. We also track training trajectories on Qwen3-4B-Base. Compared with their open-loop counterparts, PIPO-enhanced methods generally climb to stronger accuracy and reward plateaus, while maintaining longer responses that indicate sustained reasoning exploration.

Training dynamics. PIPO-enhanced methods generally reach higher and more stable average Pass@1 and reward trajectories, while supporting longer reasoning responses.

Code and Tool-Use RL Tasks

To test whether PIPO is specific to math verifiers, we train and evaluate on task-specific code and tool-use settings. These tasks use executable tests or tool feedback rather than only mathematical answer extraction, giving a different source of verifiable supervision.

Method	TACO	LCBv6	HumanEval	MBPP	Code AVG	RLLA	BFCL	APIBank	Tool AVG
PPO	23.4	31.2	85.4	41.4	45.4	87.1	66.6	90.4	81.4
+ PIPO	22.1	34.9	86.0	45.0	47.0	89.0	71.2	90.4	83.5
GRPO	23.8	31.5	83.5	44.2	45.8	85.0	58.2	90.6	77.9
+ PIPO	25.4	32.5	86.0	43.6	46.9	85.6	61.7	91.7	79.7
GSPO	23.3	32.6	81.7	45.0	45.7	85.6	59.7	89.6	78.3
+ PIPO	23.4	32.1	84.2	45.8	46.4	84.0	62.8	90.6	79.1
DAPO	25.4	33.5	86.0	43.8	47.2	87.4	62.2	90.2	79.9
+ PIPO	27.3	36.4	87.2	44.4	48.8	87.7	67.7	90.4	81.9

PIPO improves the average score for every base algorithm in both code and tool-use evaluations, supporting the claim that policy-improvement feedback transfers beyond mathematical reasoning.

Self-Distillation on SciKnowEval

We further evaluate whether PIPO remains useful when the base method already has richer local supervision. In the SciKnowEval self-distillation setting, PIPO improves both GRPO and SDPO on average, indicating that inter-iteration verification complements dense teacher-style attribution rather than replacing it.

Method	Biology	Chemistry	Material	Physics	AVG
Qwen3-8B	4.0	13.3	39.4	30.0	21.7
GRPO	40.0	72.9	71.3	47.5	57.9
+ PIPO	54.0	60.5	73.4	53.8	60.4
SDPO	52.0	76.7	73.4	77.5	69.9
+ PIPO	60.0	79.1	79.8	76.3	73.8

Robustness and Design Choices

Beyond headline accuracy, we study whether PIPO changes the reliability of the training process. These experiments look at multi-sample reasoning, sensitivity to random data seeds, the rectification coefficient, the historical window size, and wall-clock efficiency.

Pass@8 Multi-Sample Reasoning

Pass@8 measures whether the trained model can produce a stronger set of candidate solutions under multi-sample decoding. PIPO improves the average Pass@8 score for every evaluated optimizer, with especially clear gains for the group-relative baselines.

Method	MATH500	AIME25	AMC23	Minerva	Olympiad	AVG
PPO	87.2	20.4	79.0	32.9	54.0	54.7
+ PIPO	88.0	28.1	72.6	33.0	53.9	55.1
GRPO	88.3	27.6	80.2	34.7	54.1	57.0
+ PIPO	89.4	30.3	84.0	35.9	57.1	59.3
GSPO	87.2	29.1	81.3	34.8	57.2	57.9
+ PIPO	88.9	31.2	82.7	35.9	57.6	59.3
DAPO	88.2	30.7	84.9	35.9	59.4	59.8
+ PIPO	89.2	31.4	86.2	35.8	60.0	60.5

Random Seed Robustness

Sparse-reward RL can be sensitive to data order and sampling stochasticity. Across three data seeds, PIPO-enhanced methods converge to higher and more stable plateaus, suggesting that retrospective verification filters out update directions that are only accidentally favored by a particular seed.

Random seed robustness. Across three data seeds, PIPO-enhanced methods converge to stronger performance plateaus and are less affected by seed-specific sampling noise.

Ablations

PIPO has two main control knobs. The rectification coefficient $\lambda$ controls how strongly negative improvement feedback suppresses the previous update, while the window size $K$ controls how much recent history is used to build the performance anchor. Moderate settings provide the best balance between responsiveness and noise reduction.

Rectification Coefficient $\lambda$

$\lambda$	MATH500	AIME25	AMC23	Minerva	Olympiad	AVG
GRPO	79.3	18.5	60.0	26.1	41.2	45.0
0	80.5	22.2	62.5	26.5	42.1	46.8
0.05	80.1	25.9	62.5	27.6	42.6	47.7
0.1	81.3	22.2	65.0	28.3	42.7	47.9
0.2	81.9	18.5	62.5	26.8	43.1	46.6
0.5	81.1	22.2	62.5	25.4	42.0	46.6
1	79.3	18.5	65.0	25.7	41.4	46.0

Window Size $K$

$K$	MATH500	AIME25	AMC23	Minerva	Olympiad	AVG
GRPO	79.3	18.5	60.0	26.1	41.2	45.0
2	80.1	22.2	65.0	26.8	41.8	47.2
4	80.3	22.2	65.0	26.8	43.5	47.6
8	81.3	22.2	65.0	28.3	42.7	47.9
16	81.4	22.2	65.0	26.8	43.5	47.8
32	79.7	18.5	65.0	27.2	41.3	46.3

Wall-Clock Efficiency

Retrospective verification adds some algorithm-dependent computation, but it can reduce wasted optimization by correcting harmful steps. In wall-clock terms, PIPO reaches stronger accuracy under comparable or moderately increased training time.

Wall-clock efficiency. PIPO introduces algorithm-dependent overhead, but reaches stronger accuracy under comparable or moderately increased training time.

Conclusion

PIRL identifies policy-improvement feedback as a missing ingredient in current RL post-training: local rewards, advantages, and distillation targets are useful, but they do not verify whether an update actually made the policy better.

PIPO operationalizes this idea as a plug-in closed-loop framework. It compares current empirical performance with a sliding-window historical anchor, then uses the resulting improvement signal to modulate the base algorithm's local learning signal.

Across PPO, GRPO, GSPO, DAPO, and SDPO settings, PIPO yields consistent gains on mathematical reasoning, code, tool-use, and self-distillation evaluations, while improving training stability and robustness.

BibTeX

@article{wang2026policyimprovement,
  title={Policy Improvement Reinforcement Learning},
  author={Wang, Huaiyang and Li, Xiaojie and Wang, Xiaohan and Zhang, Zhixia and Lu, Xiaodong and Huang, Zixuan and Chai, Jiajun and Yin, Guojun and Wang, Deqing and Zhou, Haoyi and Yang, Yaodong and Li, Jianxin and Ban, Yikun},
  journal={arXiv preprint arXiv:2604.00860},
  year={2026}
}