Hands-on GRPO (Group Relative Policy Optimization) implementation from scratch in PyTorch. Understand GRPO by training a 135M parameter model on simple arithmetic task using T4 GPU in Colab.
GRPO was used by DeepSeek-Math to achieve state-of-the-art mathematical reasoning performance.
GRPO replaces PPO's value function with group statistics. Instead of learning a separate critic, it:
- Generates multiple responses per prompt (group)
- Uses group mean/std as baseline for advantage estimation
- Applies PPO clipping with KL regularization
GRPO objective function to be maximised:
Paper-to-Code Bridge:
- Mathematical GRPO formula → Direct PyTorch implementation
- Simplified on purpose to show clear mapping between theory and code
- Hyperparameter playground: experiment with clip_eps, beta, group_size, temperature
- See how each component (PPO clipping, KL regularization, group advantages) affects training
- Click the Colab badge above
- Runtime → Change runtime type → T4 GPU
IMPORTANT: pick 2026.01 runtime version for stable results

- Run all cells (takes ~10 minutes)
No external RL libraries - just transformers and torch
