Skip to content

NataliaTarasovaNatoshir/grpo-toy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GRPO Toy Implementation

Open In Colab

Hands-on GRPO (Group Relative Policy Optimization) implementation from scratch in PyTorch. Understand GRPO by training a 135M parameter model on simple arithmetic task using T4 GPU in Colab.

What's GRPO?

GRPO was used by DeepSeek-Math to achieve state-of-the-art mathematical reasoning performance.

GRPO replaces PPO's value function with group statistics. Instead of learning a separate critic, it:

  1. Generates multiple responses per prompt (group)
  2. Uses group mean/std as baseline for advantage estimation
  3. Applies PPO clipping with KL regularization

GRPO objective function to be maximised:

GRPO Formula

What's in the Notebook

Paper-to-Code Bridge:

  • Mathematical GRPO formula → Direct PyTorch implementation
  • Simplified on purpose to show clear mapping between theory and code
  • Hyperparameter playground: experiment with clip_eps, beta, group_size, temperature
  • See how each component (PPO clipping, KL regularization, group advantages) affects training

Usage

  1. Click the Colab badge above
  2. Runtime → Change runtime type → T4 GPU IMPORTANT: pick 2026.01 runtime version for stable results Runtime selection
  3. Run all cells (takes ~10 minutes)

No external RL libraries - just transformers and torch

About

Colab Notebook to showcase a simple GRPO training loop end-to-end

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors