A ready-to-run setup of michaelnny/deep_rl_zoo for experimenting with deep RL algorithms on Atari games (PPO, DQN, Rainbow, IQN, R2D2, NGU, Agent57, and more).
Attribution. All code under deep_rl_zoo/ and unit_tests/ is the original work of Michael Hu (michaelnny), released under Apache License 2.0. This repository is a sandbox derivative — it adds a uv-based setup script, a relaxed requirements pin set that builds on modern macOS (Apple Silicon), a FastAPI sidecar with a CRT-aesthetic research console, a CI workflow, and a small set of corrections (one bundled checkpoint with an obsolete architecture removed). The upstream README is preserved as UPSTREAM_README.md. See the LICENSE file for details.
Two ways to use this repo:
- Run the live console —
./start.sh, open the browser, click ▶ to watch real agents play, ▶ TRAIN 5K to spawn training, ▶ COMPARE to run parallel evals, ▶ REPLAY for self-play MP4s. See §2. - Drive
deep_rl_zoofrom the CLI —python -m deep_rl_zoo.<algo>.run_atarifor training,python -m deep_rl_zoo.<algo>.eval_agentfor evals. See §3 and §4.
deep_rl_zoo is research / educational code. Upstream ships 5 pre-trained checkpoints; in this sandbox we kept 4 (Pong × 3, MontezumaRevenge × 1) — the bundled PPO_Breakout_0.ckpt was removed because its network architecture predates upstream commit cd860e8 ("major update with breaking changes", June 2023) and no longer loads on current code (Missing key(s) in state_dict: policy_head.0.weight, ...).
The upstream author explicitly notes agents were "only tested on Atari Pong or Breakout, and we stop training once the agent has made some progress." That shows in the bundled weights: see the eval scores in §3. There is no community model zoo for deep_rl_zoo that gives you trained Agent57/R2D2/NGU agents on all 57 games. To get more checkpoints you train them here yourself. See §7 for external sources of pre-trained Atari agents (different ecosystem, different setup).
./setup.shWhat this does:
- Installs
snappyandffmpegvia Homebrew if missing. - Creates a Python 3.10 venv in
.venv/withuv— native arm64 on Apple Silicon (uvdownloadscpython-3.10-macos-aarch64rather than reusing the brew x86_64 build, which would run under Rosetta and be ~2× slower). - Installs PyTorch 2.2+, gym 0.25.2 (with
[box2d]extra so LunarLander works), ALE-py 0.7.5, AutoROM, python-snappy, FastAPI + uvicorn + tbparse (for the research console), and pytest + httpx (for tests). - Downloads the 109 Atari 2600 ROMs via AutoROM (license auto-accepted).
- Runs a smoke test that creates a Pong env and prints the obs shape.
Verified eval throughput on M1 (arm64): ~780–1300 steps/sec on CPU depending on algorithm. The same evals on x86_64 Python under Rosetta were 320–700 steps/sec.
After it finishes:
source .venv/bin/activateuv venv --python 3.10 .venv
SNAPPY_PREFIX=$(brew --prefix snappy)
CPPFLAGS="-I$SNAPPY_PREFIX/include" LDFLAGS="-L$SNAPPY_PREFIX/lib" \
VIRTUAL_ENV=.venv uv pip install -r requirements-relaxed.txt
.venv/bin/AutoROM --accept-licenseWhy a relaxed requirements file? The upstream requirements.txt pins old versions (torch 2.0.1, mujoco 2.2.2) that fail to install on modern macOS / Python 3.11+. requirements-relaxed.txt keeps gym==0.25.2 and ale-py==0.7.5 (mandatory — the codebase uses the old gym API and the new ale-py 0.10+ doesn't register old-style env names with old gym), loosens torch / opencv / numpy, drops mujoco (not needed for Atari or any classic-control example here), and adds the [box2d] extra so LunarLander and the upstream gym_env_test work.
./start.shThat's it. The script:
- Verifies the venv is set up.
- Frees the configured port (kills anything currently on
127.0.0.1:8000). - Cleans stale training-job logs from previous sessions.
- Spawns a background poller that opens the browser as soon as the server reports healthy.
- Runs
uvicorn frontend.server:app --reloadin the foreground, so edits tofrontend/*.pyhot-reload and Ctrl+C tears everything down.
Override defaults via env vars:
PORT=8123 ./start.sh # change port
HOST=0.0.0.0 ./start.sh # listen on all interfaces (e.g. for a remote dev box)
NO_BROWSER=1 ./start.sh # don't auto-open the browser (CI / headless)| Button (transport bar) | What it does | Requires |
|---|---|---|
| ▶ Play | Spawns a real deep_rl_zoo greedy actor against the bundled checkpoint that matches your (algo, game) selection. Streams ALE frames into the canvas + action distribution. |
A bundled checkpoint for the pair (today: IQN/Pong, PER-DQN/Pong, Rainbow/Pong, PPO-RND/MontezumaRevenge). |
| ▶ TRAIN 5K | Spawns python -m deep_rl_zoo.<algo>.run_atari with --num_train_steps=5000 --num_eval_steps=500. Logged to the event log; chart auto-refreshes when the run finishes. |
Any (algo, game) pair. |
| ▶ COMPARE | Runs 5000-step parallel evals against every bundled checkpoint and renders the real mean returns in the agent comparison panel, sorted descending. ~10s on M1 CPU. | The 4 bundled checkpoints. |
| ▶ REPLAY | Plays the most recent recordings/.../*.mp4 in an overlay above the canvas. |
At least one prior eval that wrote an MP4 (the CLI eval_agent writes one; the WebSocket play path doesn't). |
The right-side hyperparameter panel is read-only today. To train with custom values, two paths:
# (a) drive deep_rl_zoo directly from the CLI
PYTHONPATH=. python -m deep_rl_zoo.dqn.run_atari \
--environment_name=Breakout --learning_rate=0.0001 --discount=0.99 \
--num_iterations=2 --num_train_steps=50000
# (b) pass extra absl flags through the API (whitelisted: anything starting with --)
curl -X POST http://127.0.0.1:8000/api/training/start -H 'Content-Type: application/json' -d '{
"algo": "rainbow",
"game": "Breakout",
"num_train_steps": 10000,
"extra_args": ["--learning_rate=0.0005", "--discount=0.99"]
}'The chart picks up new tensorboard scalars automatically the next time you re-render it (selecting a different game, or after a TRAIN 5K run completes).
See frontend/README.md for the full API reference (read-only / WebSocket / training / comparison endpoints) and the architecture diagram.
Four pre-trained checkpoints are kept (see §1 for why we removed the fifth). Quickest test:
# IQN agent on Pong, 2000 eval steps, no tensorboard
python -m deep_rl_zoo.iqn.eval_agent \
--environment_name=Pong \
--load_checkpoint_file=./checkpoints/IQN_Pong_2.ckpt \
--num_iterations=1 \
--num_eval_steps=2000 \
--nouse_tensorboardThis will:
- Run the agent in greedy (deterministic) mode.
- Print
episode_returnper iteration. - Record an MP4 of self-play under
recordings/.
Available bundled checkpoints (and what to expect over 10k steps on M1 arm64):
| File | Algorithm | Game | eval module | Observed episode_return (10k steps) |
|---|---|---|---|---|
IQN_Pong_2.ckpt |
IQN | Pong | deep_rl_zoo.iqn.eval_agent |
−2.50 (loses; undertrained) |
PER-DQN_Pong_4.ckpt |
Prioritized DQN | Pong | deep_rl_zoo.prioritized_dqn.eval_agent |
+14.0 (wins) |
Rainbow_Pong_2.ckpt |
Rainbow | Pong | deep_rl_zoo.rainbow.eval_agent |
+10.3 (wins) |
PPO-RND_MontezumaRevenge_2.ckpt |
PPO + RND | MontezumaRevenge | deep_rl_zoo.ppo_rnd.eval_agent |
0.0 (sparse-reward game; doesn't reach a key in 10k steps) |
Pong is scored from −21 to +21. The PER-DQN and Rainbow checkpoints win comfortably; IQN is a partial-training snapshot that loses. The PPO-RND/Montezuma checkpoint is also early-training — Montezuma needs millions of steps to see meaningful exploration.
The upstream defaults are aggressive: dqn.run_atari is num_iterations=100 × num_train_steps=500_000 = 50M frames per run. On a CPU-only Mac that's days. Always pass smaller numbers for a quick smoke run, then scale up once you know the pipeline works.
# Quick sanity run (~2 min on M1 CPU): 1 iteration, 5k train steps, 1k eval steps
python -m deep_rl_zoo.dqn.run_atari --environment_name=Pong \
--num_iterations=1 --num_train_steps=5000 --num_eval_steps=1000 \
--replay_capacity=10000 --min_replay_size=2000
# Distributed PPO with 8 actors on Breakout (long run; default ≈50M frames)
python -m deep_rl_zoo.ppo.run_atari --environment_name=Breakout --num_actors=8
# Agent57 on a hard exploration game (very long; tune iterations down for a smoke test)
python -m deep_rl_zoo.agent57.run_atari --environment_name=MontezumaRevenge --num_actors=8Each run writes:
- TensorBoard logs to
runs/ - Checkpoints to
checkpoints/every N iterations
Watch progress:
tensorboard --logdir=./runspython -m deep_rl_zoo.dqn.run_classic --environment_name=CartPole-v1
python -m deep_rl_zoo.ppo.run_classic --environment_name=LunarLander-v2 --num_actors=4Policy-based: reinforce, reinforce_baseline, actor_critic, a2c, sac, ppo, ppo_icm, ppo_rnd, impala Value-based: dqn, double_dqn, prioritized_dqn, drqn, r2d2, ngu, agent57 Distributional: c51_dqn, rainbow, qr_dqn, iqn
Each algorithm has the same three entry points: run_classic, run_atari, eval_agent.
The upstream ./run_unit_tests.sh and ./run_e2e_tests.sh call python3 directly — they assume .venv is activated. Without activation you'll get ModuleNotFoundError: No module named 'absl'. Run source .venv/bin/activate first.
Three suites:
# Frontend API tests — 18 tests, ~0.5s. Pure pytest, no subprocess spawning.
PYTHONPATH=. pytest frontend/test_server.py -v
# Upstream unit tests — 130 tests, ~5s. Pure-function tests for losses, replay,
# env wrappers, checkpoint serialization, etc.
./run_unit_tests.sh
# Upstream end-to-end tests — ~60 tests, 30-90s each. Actually launches every
# algorithm's run_classic / run_atari / eval_agent for a few hundred steps.
# Total runtime ~1 hour. Catches regressions when you modify upstream code.
./run_e2e_tests.shCI (.github/workflows/test.yml) runs the frontend-api tests + upstream unit tests + an IQN smoke eval on every push.
| You want to... | Edit |
|---|---|
| Tweak network architectures | deep_rl_zoo/networks/{policy,value,curiosity}.py |
| Add an Atari preprocessing wrapper | deep_rl_zoo/gym_env.py |
| Change a loss function | deep_rl_zoo/{policy_gradient,value_learning,nonlinear_bellman}.py |
| Modify the experience replay (PER, R2D2 sequence buffer, NGU episodic memory) | deep_rl_zoo/replay.py |
| Change distributed actor/learner orchestration | deep_rl_zoo/main_loop.py, deep_rl_zoo/distributed.py |
| Add a new algorithm | Copy any existing folder (e.g. deep_rl_zoo/dqn/) — it has agent.py, run_classic.py, run_atari.py, eval_agent.py |
| Tune hyperparameters (CLI) | Top of each run_atari.py (absl-py FLAGS) |
| Add a new API endpoint | frontend/server.py (FastAPI app) — re-export tests in frontend/test_server.py |
| Wire a UI panel to real data | frontend/app.js — fetch from /api/..., render into existing markup in frontend/index.html |
| Support more checkpoint types in the WebSocket play path | frontend/stream_eval.py — add a factory in ALGO_FACTORIES |
The hyperparameters in the upstream code are not fine-tuned (author's own caveat). Any HP sweep is a useful experiment.
If you want trained agents on more games right now, the largest public source is the Stable-Baselines3 Zoo on HuggingFace: https://huggingface.co/sb3. Coverage: DQN, PPO, QR-DQN, A2C × ~25 popular Atari games.
These are not integrable with this venv — SB3 needs gymnasium + a recent ale-py, while deep_rl_zoo is locked to gym 0.25.2 + ale-py 0.7.5. The two stacks are mutually incompatible because ale-py 0.7.5 only auto-registers env names with old gym, and ale-py 0.10+ only auto-registers with gymnasium.
If you want to use them, set up a separate venv in a sibling directory (NOT inside this Atari57/ folder):
mkdir -p ../Atari57_sb3 && cd ../Atari57_sb3
uv venv --python 3.11 .venv
VIRTUAL_ENV=.venv uv pip install "stable-baselines3[extra]" sb3-contrib huggingface-sb3 "ale-py>=0.10" "gymnasium[atari]"Then load any agent via huggingface_sb3.load_from_hub(repo_id="sb3/dqn-PongNoFrameskip-v4", filename="dqn-PongNoFrameskip-v4.zip") and stable_baselines3.DQN.load(...). Note that this sibling-venv recipe is unverified — it's the standard SB3 quick-start, but I haven't actually built it on this machine. It exists here as a starting point, not a guarantee.
No module named 'snappy' — python-snappy needs brew install snappy first; setup.sh handles this.
Could not initialize NNPACK (CPU warning) — harmless on M1.
Deprecation warnings about old gym step API and np.bool8 — expected. gym==0.25.2 is pinned because deep_rl_zoo predates the gym → gymnasium migration. Don't bump it.
render_mode warnings during eval — gym 0.25.2 quirk; the MP4 still renders correctly.
Slow training on CPU — Atari training is GPU-friendly. The upstream code only switches between CUDA and CPU (every run_*.py has the same line). On Mac it falls back to CPU even when MPS is available.
Canonical reads for the headline algorithms in this repo:
- Agent57 (Badia et al., DeepMind, 2020) — first agent above human on all 57 Atari games. https://arxiv.org/abs/2003.13350
- R2D2 (Kapturowski et al., DeepMind, 2019) — distributed recurrent replay foundation. https://openreview.net/pdf?id=r1lyTjAqYX
- NGU (Badia et al., 2020) — exploration curriculum that Agent57 builds on. https://arxiv.org/abs/2002.06038
- IQN (Dabney et al., 2018) — implicit quantile distributional RL. https://arxiv.org/abs/1806.06923
- Rainbow (Hessel et al., 2017) — combined DQN improvements. https://arxiv.org/abs/1710.02298
These map to deep_rl_zoo/{agent57,r2d2,ngu,iqn,rainbow}/.
Jose Lopez — AI engineer in Madrid, working on the intersection of biological and artificial intelligence.
- GitHub: @aifriend
- LinkedIn: jafdl
- Website: auto-latam.com
