Training Qwen2.5-1.5B-Instruct on GSM8K math word problems with Group Relative Policy Optimization (GRPO), the RL technique behind DeepSeek-R1. A LoRA adapter, 8 A4000s split 4/4 between a vLLM generation server and DDP training, 2 epochs: strict accuracy climbs from 1.7% to 48.7%.

Model Strict Lenient
Qwen2.5-1.5B-Instruct (baseline) 1.7% 38.3%
+ GRPO (ours) 48.7% 55.0%
Δ +47.0 pts +16.7 pts

Strict counts answers inside <answer> tags; lenient falls back to the last number in the output. The baseline mostly knows the math but won’t use the required format, so strict is near zero. After training the model reliably emits the XML structure and its math improves on top of that.

Before and after

Two GSM8K test questions, same greedy-decode settings, run through the baseline model and the GRPO-trained model.

Q1. James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week? Ground truth: 540

Baseline (no format tag)

To determine how many total meters James runs in a week, we need to follow these steps:

1. Calculate the number of meters James runs per session.
2. Multiply that by the number of sessions he has.

Step 1: ... 3 × 60 meters = 180 meters.
Step 2: 180 meters/session × 3 sessions/week = 540 meters/week.

The final answer is 540 meters.Right answer, no <answer> tag → 0 under strict, 1 under lenient.

Trained

<reasoning>
James runs 3 sprints per day and runs 3 days a week, so he runs a
total of 3 × 3 = 9 sprints in a week.
Each sprint is 60 meters long, so James runs a total of
9 × 60 = 540 meters in a week.
</reasoning>
<answer>540</answer>Right answer, right format → credit under both.
Q2. Eliza's rate per hour for the first 40 hours is $10. Overtime pay is 1.2× her regular rate. She worked 45 hours this week. What are her earnings? Ground truth: 460

Baseline (arithmetic slip)

Regular pay = Regular rate × Hours worked
Regular pay = $10 × 40

Overtime pay = 1.2 × (10 × 40)

Let's perform these calculations:
- Regular pay: $10 × 40 = $40
- Overtime pay: 1.2 × ($10 × 40) = 1.2 × $40 = $48
- Total earnings: $40 + $48 = $88

Therefore, Eliza's total earnings for the week are $88.Sets up the right decomposition, then asserts $10 × 40 = $40 and cascades the error downstream. Never separates the 5 overtime hours from the 40 regular ones either.

Trained

<reasoning>
Regular earnings = $10/hour × 40 hours = $400

Overtime hours = 45 − 40 = 5 hours
Overtime pay rate = 1.2 × $10/hour = $12/hour
Overtime earnings = $12/hour × 5 hours = $60

Total earnings = $400 + $60 = $460
</reasoning>

<answer>460</answer>Correctly separates regular vs overtime hours, keeps its bookkeeping clean, tags the answer.

Q1 is the easy win: baseline knew the answer, just wouldn’t tag it. In Q2, the baseline sets up the right decomposition, asserts $10 × 40 = $40 in the first step, and never recovers; the trained model keeps its steps clear and answers $460. Most of the +16.7 lenient points look like Q2: GRPO pressuring the model to keep its steps clear.

About this project

Reinforcement learning is an important pillar of ML because you can formulate any problem as an RL problem. Give an agent a way to act, a way to observe, and a scalar reward at the end, and the same algorithm can teach it to balance a cart-pole, beat Lee Sedol at Go, stack cups on a table, or solve competition math. The only changes are the policy (an MLP for cart-pole, a 1.5B transformer here) and the reward. This is why marrying it with neural networks (which are basically giant learnable circuits that can represent any algorithm in the Turing sense) has proven to be so powerful.

So much generality comes at the cost of underspecification and the tendency for model collapse. To train models that can be useful beyond toy tasks, you have to deal with exploration cost and reward variance. PPO (Schulman et al., 2017) was the main policy-gradient algorithm for a decade. You sample trajectories under the current policy, compute an advantage against a baseline, and take a gradient step clipped so the new policy can’t stray too far from the old one on any single action. The clip is what made policy gradients stable enough to run for millions of iterations without collapse. PPO is the algorithm inside the original ChatGPT RLHF pipeline.

PPO’s baseline comes from a critic: a learned value network almost as big as the policy. GRPO (Shao et al., 2024) removes the critic. For each prompt, it samples a group of $G$ completions, scores all of them, and uses the group’s own mean and std as the baseline, $A_i = (R_i - \text{mean}(R)) / \text{std}(R)$. Above-average completions get pushed up, below-average ones get pushed down, and no second network can drift from the true reward. The PPO clipped ratio and KL-to-base remain:

\[J_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim D,\, \{o_i\} \sim \pi_{\theta_{\text{old}}}}\!\left[\frac{1}{G}\sum_{i=1}^{G} \min\!\left(r_i(\theta)\, A_i,\; \text{clip}\!\left(r_i(\theta), 1\pm\epsilon\right) A_i\right)\right] - \beta\, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\]

with $r_i(\theta) = \pi_\theta(o_i \mid q) / \pi_{\theta_{\text{old}}}(o_i \mid q)$.

This takes away the rest of the old RLHF stack as well, no learned reward model means there is no preference data to collect. All you need to do is sample completions, check them with a verifier, and update. The verifier can be anything deterministic, eg. a math checker, a unit-test runner, a theorem prover or a compiler. DeepSeek-R1 (DeepSeek-AI, 2025) scaled this loop into a reasoning model competitive with the strongest frontier models in early 2025. This project is the miniature version I created to wrap my head around this loop.

We see so many tasks out there that don’t have fully deterministic rewards but are still very specific; humans can clearly tell right from wrong for them. This gives me a feeling that future AI will involve creating complex yet robust reward models, alongside specific RL algorithms designed to learn well from such reward models.

Setup

Component Details
Model Qwen2.5-1.5B-Instruct, LoRA (rank 16, alpha 32) on q/k/v/o projections
Algorithm GRPO via TRL’s GRPOTrainer
Dataset GSM8K train split (7,473 examples, 2 epochs)
Rewards Correctness (weight 2.0): full credit for match inside <answer>, 0.5 for match as last number. Format (0.5): valid <reasoning>/<answer> structure. Int (0.5): extracted answer parses as a number.
KL penalty $\beta = 0.01$
Optimizer PagedAdamW 8-bit, cosine LR, 10% warmup, weight decay 0.1
Effective batch 2 × 8 generations × 4 accumulation = 64
Hardware 8× RTX A4000 (16GB) — 4 for vLLM generation, 4 for DDP training

We tell ourselves we understand something just by reading it, but there’s truly no substitution to implementing something. One simple thing I learned was to use vLLM as a separate generation server. When doing online RL on LLMs, sampling is the true bottleneck, not the gradient step. Every optimizer step needs $G = 8$ completions per prompt and completions take hundreds of sequential steps, which dwarfs the training compute. TRL supports running vLLM on dedicated GPUs as an HTTP sampling server with hot-reloaded policy weights, here 4 A4000s are serving and 4 are doing DDP.

Training dynamics

Training curves pulled from the final checkpoint’s trainer_state.json (one point every 10 steps, 3,738 total over 2 epochs).

Hover for exact values. Toggle series with the chips above.

As expected, format reward saturates within a few hundred steps as wrapping the answer in <reasoning>/<answer> is the easiest signal to pick up. Correctness creeps up over the full run and is still climbing by epoch 2. KL to the base model stays small ($\beta = 0.01$), so the instruct model’s general capabilities survive. Total reward rises from ~0.7 to ~2.5.

Takeaways

Most of the 47-point strict delta is format compliance. The +16.7 lenient delta is the honest “did math reasoning improve” number, still substantial for 1.5B parameters on two epochs of 7.5k examples. Bigger gains would want a bigger base model and longer generations for multi-step scratch work, but as a proof that GRPO runs on commodity GPUs with no reward model and no preference data, pretty encouraging. It’s also just cool that I can teach an AI how to do math in English, like wtf we are literally living in the future.

Resources

Papers

Explainers

Libraries

  • TRL. GRPOTrainer does the heavy lifting here.
  • PEFT. LoRA adapters.
  • vLLM. Fast generation server for the sampling-heavy inner loop.
  • bitsandbytes. 8-bit optimizer.