Skip to content

aifriend/atari57-sandbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Atari57 research sandbox — CRT-style console

CI Python PyTorch License macOS arm64

Atari57 Research Sandbox

A ready-to-run setup of michaelnny/deep_rl_zoo for experimenting with deep RL algorithms on Atari games (PPO, DQN, Rainbow, IQN, R2D2, NGU, Agent57, and more).

Attribution. All code under deep_rl_zoo/ and unit_tests/ is the original work of Michael Hu (michaelnny), released under Apache License 2.0. This repository is a sandbox derivative — it adds a uv-based setup script, a relaxed requirements pin set that builds on modern macOS (Apple Silicon), a FastAPI sidecar with a CRT-aesthetic research console, a CI workflow, and a small set of corrections (one bundled checkpoint with an obsolete architecture removed). The upstream README is preserved as UPSTREAM_README.md. See the LICENSE file for details.

Two ways to use this repo:

  • Run the live console./start.sh, open the browser, click ▶ to watch real agents play, ▶ TRAIN 5K to spawn training, ▶ COMPARE to run parallel evals, ▶ REPLAY for self-play MP4s. See §2.
  • Drive deep_rl_zoo from the CLIpython -m deep_rl_zoo.<algo>.run_atari for training, python -m deep_rl_zoo.<algo>.eval_agent for evals. See §3 and §4.

⚠️ About "play all 57 Atari games out of the box"

deep_rl_zoo is research / educational code. Upstream ships 5 pre-trained checkpoints; in this sandbox we kept 4 (Pong × 3, MontezumaRevenge × 1) — the bundled PPO_Breakout_0.ckpt was removed because its network architecture predates upstream commit cd860e8 ("major update with breaking changes", June 2023) and no longer loads on current code (Missing key(s) in state_dict: policy_head.0.weight, ...).

The upstream author explicitly notes agents were "only tested on Atari Pong or Breakout, and we stop training once the agent has made some progress." That shows in the bundled weights: see the eval scores in §3. There is no community model zoo for deep_rl_zoo that gives you trained Agent57/R2D2/NGU agents on all 57 games. To get more checkpoints you train them here yourself. See §7 for external sources of pre-trained Atari agents (different ecosystem, different setup).

1. Setup (one command)

./setup.sh

What this does:

  • Installs snappy and ffmpeg via Homebrew if missing.
  • Creates a Python 3.10 venv in .venv/ with uv — native arm64 on Apple Silicon (uv downloads cpython-3.10-macos-aarch64 rather than reusing the brew x86_64 build, which would run under Rosetta and be ~2× slower).
  • Installs PyTorch 2.2+, gym 0.25.2 (with [box2d] extra so LunarLander works), ALE-py 0.7.5, AutoROM, python-snappy, FastAPI + uvicorn + tbparse (for the research console), and pytest + httpx (for tests).
  • Downloads the 109 Atari 2600 ROMs via AutoROM (license auto-accepted).
  • Runs a smoke test that creates a Pong env and prints the obs shape.

Verified eval throughput on M1 (arm64): ~780–1300 steps/sec on CPU depending on algorithm. The same evals on x86_64 Python under Rosetta were 320–700 steps/sec.

After it finishes:

source .venv/bin/activate

Manual setup (if you skip setup.sh)

uv venv --python 3.10 .venv
SNAPPY_PREFIX=$(brew --prefix snappy)
CPPFLAGS="-I$SNAPPY_PREFIX/include" LDFLAGS="-L$SNAPPY_PREFIX/lib" \
  VIRTUAL_ENV=.venv uv pip install -r requirements-relaxed.txt
.venv/bin/AutoROM --accept-license

Why a relaxed requirements file? The upstream requirements.txt pins old versions (torch 2.0.1, mujoco 2.2.2) that fail to install on modern macOS / Python 3.11+. requirements-relaxed.txt keeps gym==0.25.2 and ale-py==0.7.5 (mandatory — the codebase uses the old gym API and the new ale-py 0.10+ doesn't register old-style env names with old gym), loosens torch / opencv / numpy, drops mujoco (not needed for Atari or any classic-control example here), and adds the [box2d] extra so LunarLander and the upstream gym_env_test work.

2. Run the research console (recommended)

./start.sh

That's it. The script:

  • Verifies the venv is set up.
  • Frees the configured port (kills anything currently on 127.0.0.1:8000).
  • Cleans stale training-job logs from previous sessions.
  • Spawns a background poller that opens the browser as soon as the server reports healthy.
  • Runs uvicorn frontend.server:app --reload in the foreground, so edits to frontend/*.py hot-reload and Ctrl+C tears everything down.

Override defaults via env vars:

PORT=8123 ./start.sh       # change port
HOST=0.0.0.0 ./start.sh    # listen on all interfaces (e.g. for a remote dev box)
NO_BROWSER=1 ./start.sh    # don't auto-open the browser (CI / headless)

What you can do once the page loads

Button (transport bar) What it does Requires
▶ Play Spawns a real deep_rl_zoo greedy actor against the bundled checkpoint that matches your (algo, game) selection. Streams ALE frames into the canvas + action distribution. A bundled checkpoint for the pair (today: IQN/Pong, PER-DQN/Pong, Rainbow/Pong, PPO-RND/MontezumaRevenge).
▶ TRAIN 5K Spawns python -m deep_rl_zoo.<algo>.run_atari with --num_train_steps=5000 --num_eval_steps=500. Logged to the event log; chart auto-refreshes when the run finishes. Any (algo, game) pair.
▶ COMPARE Runs 5000-step parallel evals against every bundled checkpoint and renders the real mean returns in the agent comparison panel, sorted descending. ~10s on M1 CPU. The 4 bundled checkpoints.
▶ REPLAY Plays the most recent recordings/.../*.mp4 in an overlay above the canvas. At least one prior eval that wrote an MP4 (the CLI eval_agent writes one; the WebSocket play path doesn't).

Custom hyperparameters

The right-side hyperparameter panel is read-only today. To train with custom values, two paths:

# (a) drive deep_rl_zoo directly from the CLI
PYTHONPATH=. python -m deep_rl_zoo.dqn.run_atari \
  --environment_name=Breakout --learning_rate=0.0001 --discount=0.99 \
  --num_iterations=2 --num_train_steps=50000

# (b) pass extra absl flags through the API (whitelisted: anything starting with --)
curl -X POST http://127.0.0.1:8000/api/training/start -H 'Content-Type: application/json' -d '{
  "algo": "rainbow",
  "game": "Breakout",
  "num_train_steps": 10000,
  "extra_args": ["--learning_rate=0.0005", "--discount=0.99"]
}'

The chart picks up new tensorboard scalars automatically the next time you re-render it (selecting a different game, or after a TRAIN 5K run completes).

See frontend/README.md for the full API reference (read-only / WebSocket / training / comparison endpoints) and the architecture diagram.

3. Try a bundled checkpoint (CLI)

Four pre-trained checkpoints are kept (see §1 for why we removed the fifth). Quickest test:

# IQN agent on Pong, 2000 eval steps, no tensorboard
python -m deep_rl_zoo.iqn.eval_agent \
  --environment_name=Pong \
  --load_checkpoint_file=./checkpoints/IQN_Pong_2.ckpt \
  --num_iterations=1 \
  --num_eval_steps=2000 \
  --nouse_tensorboard

This will:

  • Run the agent in greedy (deterministic) mode.
  • Print episode_return per iteration.
  • Record an MP4 of self-play under recordings/.

Available bundled checkpoints (and what to expect over 10k steps on M1 arm64):

File Algorithm Game eval module Observed episode_return (10k steps)
IQN_Pong_2.ckpt IQN Pong deep_rl_zoo.iqn.eval_agent −2.50 (loses; undertrained)
PER-DQN_Pong_4.ckpt Prioritized DQN Pong deep_rl_zoo.prioritized_dqn.eval_agent +14.0 (wins)
Rainbow_Pong_2.ckpt Rainbow Pong deep_rl_zoo.rainbow.eval_agent +10.3 (wins)
PPO-RND_MontezumaRevenge_2.ckpt PPO + RND MontezumaRevenge deep_rl_zoo.ppo_rnd.eval_agent 0.0 (sparse-reward game; doesn't reach a key in 10k steps)

Pong is scored from −21 to +21. The PER-DQN and Rainbow checkpoints win comfortably; IQN is a partial-training snapshot that loses. The PPO-RND/Montezuma checkpoint is also early-training — Montezuma needs millions of steps to see meaningful exploration.

4. Train your own (CLI)

On Atari

The upstream defaults are aggressive: dqn.run_atari is num_iterations=100 × num_train_steps=500_000 = 50M frames per run. On a CPU-only Mac that's days. Always pass smaller numbers for a quick smoke run, then scale up once you know the pipeline works.

# Quick sanity run (~2 min on M1 CPU): 1 iteration, 5k train steps, 1k eval steps
python -m deep_rl_zoo.dqn.run_atari --environment_name=Pong \
  --num_iterations=1 --num_train_steps=5000 --num_eval_steps=1000 \
  --replay_capacity=10000 --min_replay_size=2000

# Distributed PPO with 8 actors on Breakout (long run; default ≈50M frames)
python -m deep_rl_zoo.ppo.run_atari --environment_name=Breakout --num_actors=8

# Agent57 on a hard exploration game (very long; tune iterations down for a smoke test)
python -m deep_rl_zoo.agent57.run_atari --environment_name=MontezumaRevenge --num_actors=8

Each run writes:

  • TensorBoard logs to runs/
  • Checkpoints to checkpoints/ every N iterations

Watch progress:

tensorboard --logdir=./runs

On classic control (CartPole, LunarLander) — minutes, useful for sanity checks

python -m deep_rl_zoo.dqn.run_classic --environment_name=CartPole-v1
python -m deep_rl_zoo.ppo.run_classic --environment_name=LunarLander-v2 --num_actors=4

Algorithms available

Policy-based: reinforce, reinforce_baseline, actor_critic, a2c, sac, ppo, ppo_icm, ppo_rnd, impala Value-based: dqn, double_dqn, prioritized_dqn, drqn, r2d2, ngu, agent57 Distributional: c51_dqn, rainbow, qr_dqn, iqn

Each algorithm has the same three entry points: run_classic, run_atari, eval_agent.

5. Run the test suites

The upstream ./run_unit_tests.sh and ./run_e2e_tests.sh call python3 directly — they assume .venv is activated. Without activation you'll get ModuleNotFoundError: No module named 'absl'. Run source .venv/bin/activate first.

Three suites:

# Frontend API tests — 18 tests, ~0.5s. Pure pytest, no subprocess spawning.
PYTHONPATH=. pytest frontend/test_server.py -v

# Upstream unit tests — 130 tests, ~5s. Pure-function tests for losses, replay,
# env wrappers, checkpoint serialization, etc.
./run_unit_tests.sh

# Upstream end-to-end tests — ~60 tests, 30-90s each. Actually launches every
# algorithm's run_classic / run_atari / eval_agent for a few hundred steps.
# Total runtime ~1 hour. Catches regressions when you modify upstream code.
./run_e2e_tests.sh

CI (.github/workflows/test.yml) runs the frontend-api tests + upstream unit tests + an IQN smoke eval on every push.

6. Where to make changes (quick map)

You want to... Edit
Tweak network architectures deep_rl_zoo/networks/{policy,value,curiosity}.py
Add an Atari preprocessing wrapper deep_rl_zoo/gym_env.py
Change a loss function deep_rl_zoo/{policy_gradient,value_learning,nonlinear_bellman}.py
Modify the experience replay (PER, R2D2 sequence buffer, NGU episodic memory) deep_rl_zoo/replay.py
Change distributed actor/learner orchestration deep_rl_zoo/main_loop.py, deep_rl_zoo/distributed.py
Add a new algorithm Copy any existing folder (e.g. deep_rl_zoo/dqn/) — it has agent.py, run_classic.py, run_atari.py, eval_agent.py
Tune hyperparameters (CLI) Top of each run_atari.py (absl-py FLAGS)
Add a new API endpoint frontend/server.py (FastAPI app) — re-export tests in frontend/test_server.py
Wire a UI panel to real data frontend/app.js — fetch from /api/..., render into existing markup in frontend/index.html
Support more checkpoint types in the WebSocket play path frontend/stream_eval.py — add a factory in ALGO_FACTORIES

The hyperparameters in the upstream code are not fine-tuned (author's own caveat). Any HP sweep is a useful experiment.

7. External pre-trained Atari agents (different ecosystem)

If you want trained agents on more games right now, the largest public source is the Stable-Baselines3 Zoo on HuggingFace: https://huggingface.co/sb3. Coverage: DQN, PPO, QR-DQN, A2C × ~25 popular Atari games.

These are not integrable with this venv — SB3 needs gymnasium + a recent ale-py, while deep_rl_zoo is locked to gym 0.25.2 + ale-py 0.7.5. The two stacks are mutually incompatible because ale-py 0.7.5 only auto-registers env names with old gym, and ale-py 0.10+ only auto-registers with gymnasium.

If you want to use them, set up a separate venv in a sibling directory (NOT inside this Atari57/ folder):

mkdir -p ../Atari57_sb3 && cd ../Atari57_sb3
uv venv --python 3.11 .venv
VIRTUAL_ENV=.venv uv pip install "stable-baselines3[extra]" sb3-contrib huggingface-sb3 "ale-py>=0.10" "gymnasium[atari]"

Then load any agent via huggingface_sb3.load_from_hub(repo_id="sb3/dqn-PongNoFrameskip-v4", filename="dqn-PongNoFrameskip-v4.zip") and stable_baselines3.DQN.load(...). Note that this sibling-venv recipe is unverified — it's the standard SB3 quick-start, but I haven't actually built it on this machine. It exists here as a starting point, not a guarantee.

9. Common gotchas

No module named 'snappy'python-snappy needs brew install snappy first; setup.sh handles this.

Could not initialize NNPACK (CPU warning) — harmless on M1.

Deprecation warnings about old gym step API and np.bool8 — expected. gym==0.25.2 is pinned because deep_rl_zoo predates the gym → gymnasium migration. Don't bump it.

render_mode warnings during eval — gym 0.25.2 quirk; the MP4 still renders correctly.

Slow training on CPU — Atari training is GPU-friendly. The upstream code only switches between CUDA and CPU (every run_*.py has the same line). On Mac it falls back to CPU even when MPS is available.

10. Bibliography

Canonical reads for the headline algorithms in this repo:

These map to deep_rl_zoo/{agent57,r2d2,ngu,iqn,rainbow}/.

Author

Jose Lopez — AI engineer in Madrid, working on the intersection of biological and artificial intelligence.

About

Ready-to-run sandbox of michaelnny/deep_rl_zoo (PyTorch deep RL: DQN, PPO, R2D2, Agent57, more) — uv venv, working bundled checkpoints, M1 arm64 native, CI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors