GRPO Toy Implementation

Hands-on GRPO (Group Relative Policy Optimization) implementation from scratch in PyTorch. Understand GRPO by training a 135M parameter model on simple arithmetic task using T4 GPU in Colab.

What's GRPO?

GRPO was used by DeepSeek-Math to achieve state-of-the-art mathematical reasoning performance.

GRPO replaces PPO's value function with group statistics. Instead of learning a separate critic, it:

Generates multiple responses per prompt (group)
Uses group mean/std as baseline for advantage estimation
Applies PPO clipping with KL regularization

GRPO objective function to be maximised:

What's in the Notebook

Paper-to-Code Bridge:

Mathematical GRPO formula → Direct PyTorch implementation
Simplified on purpose to show clear mapping between theory and code
Hyperparameter playground: experiment with clip_eps, beta, group_size, temperature
See how each component (PPO clipping, KL regularization, group advantages) affects training

Usage

Click the Colab badge above
Runtime → Change runtime type → T4 GPU IMPORTANT: pick 2026.01 runtime version for stable results
Run all cells (takes ~10 minutes)

No external RL libraries - just transformers and torch

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
GRPO_toy.ipynb		GRPO_toy.ipynb
LICENSE		LICENSE
README.md		README.md
formula.png		formula.png
runtime.png		runtime.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRPO Toy Implementation

What's GRPO?

What's in the Notebook

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GRPO Toy Implementation

What's GRPO?

What's in the Notebook

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages