Pokemon LLM Harness

A local harness for running LLM agents against Pokemon Red/Blue, with live gameplay, structured traces, save states, replay, and turn-by-turn observability.

Use it to watch what an agent saw, what it decided, which button it pressed, and how a run can be inspected or replayed afterwards.

Features

Live browser UI with gameplay on the left and agent/environment traces on the right.
Turn-based trace cards for observations, decisions, actions, LLM calls, screenshots, and raw payloads.
Save states and checkpoints for pausing, rewinding, branching, and replaying runs.
Provider-neutral harness API: write an agent in Python or call the HTTP API from another language.
Example LLM agent using an OpenAI-compatible provider.

Requirements

Python 3.12+
uv
Node.js and npm
RGBDS, used to build local Pokemon Red/Blue-compatible ROMs from source

The documented setup is expected to work on macOS, Linux, and WSL. Native Windows is not currently verified; WSL is the recommended Windows path.

Setup

Install RGBDS:

# macOS
brew install rgbds

On Linux or WSL, install RGBDS through your package manager or from the RGBDS project instructions.

Run the project setup:

scripts/setup.sh

This installs Python/UI dependencies, builds local ROM-compatible binaries, starts a temporary backend, and creates the default bedroom save state.

Launch

scripts/dev.sh

This starts the backend, UI, and all agents listed in agents.yaml. Open http://localhost:5173, select an agent from the harness dropdown, and click Play.

Agent stdout/stderr is written to logs/<agent>.log.

To add or remove agents, edit agents.yaml:

agents:
  - name: my_agent
    module: harness.examples.my_agent

Build Your Own Agent

Create a subclass of PokemonAgent, set a name, and implement run():

from harness import PokemonAgent

class FirstAgent(PokemonAgent):
    name = "First Agent"
    model = "qwen/qwen3.6-flash"

    def run(self) -> None:
        while not self.should_stop():
            state = self.state()

            self.emit("observation", {"pokemon": state["pokemon"]})
            self.emit("decision", {"action": "RIGHT", "reasoning": "Moving toward the exit."})
            self.press("RIGHT")

if __name__ == "__main__":
    FirstAgent().serve()

A fuller working template is in harness/examples/first_agent.py.

Inside run(), the main helpers are:

Method	Description
`screenshot_bytes()`	Current game screen as PNG bytes
`screenshot(path)`	Save the current game screen to a file
`state()`	Current game state: map, position, party, screen hash
`press(button)`	Press A / B / UP / DOWN / LEFT / RIGHT / START / SELECT
`sequence(steps)`	Run button/wait steps as one atomic sequence
`save_state(name)`	Save a run-local checkpoint
`load_state(name)`	Load a run-local or shared checkpoint
`emit(type, payload)`	Add a structured event to the trace UI
`should_stop()`	Check whether the UI asked the agent to stop

Override serialize_history() and restore_history(data) if your agent has message history, memory, or planning state that should rewind with a checkpoint.

Use turn() to group one logical agent step:

with self.turn(goal="leave the bedroom"):
    state = self.state()
    self.emit("observation", {"pokemon": state["pokemon"]})
    self.emit("decision", {"action": "RIGHT", "reasoning": "Moving toward the exit."})
    self.press("RIGHT")

LLM Providers

harness.llm.LLMClient wraps provider calls with retry/backoff and normalized response/error payloads. The example agent uses OpenRouter by default:

from harness.llm import LLMClient, provider_from_env

llm = LLMClient(provider_from_env("openrouter"))
response = llm.chat(messages, model="qwen/qwen3.6-flash")

Built-in provider presets:

Preset	Env var	Notes
`openrouter`	`OPENROUTER_API_KEY`	Default example path
`openai`	`OPENAI_API_KEY`	Uses the OpenAI SDK default base URL
`gemini`	`GEMINI_API_KEY`	Uses Gemini's OpenAI-compatible endpoint

For another OpenAI-compatible provider, pass your own LLMProviderConfig. For a different API shape, implement the small LLMProvider protocol.

Useful Commands

Run backend tests and the frontend production build:

scripts/verify.sh

Replay button actions from a recorded environment trace:

uv run python -m harness.replay runs/<run-id>/env.jsonl

See docs/architecture.md for the repository layout and API shape.

Legal Boundary

This project does not download or distribute commercial ROM files. The setup script can build local ROM-compatible binaries from pret/pokered source for personal development, but you are responsible for making sure your use complies with applicable law.

Generated ROMs, save states, traces, and cloned upstream source are local-only artifacts and are ignored by git.

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
docs		docs
env		env
harness		harness
roms		roms
runs		runs
scripts		scripts
states		states
tests		tests
third_party		third_party
ui		ui
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LESSONS.md		LESSONS.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
agents.yaml		agents.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock
verify.md		verify.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pokemon LLM Harness

Features

Requirements

Setup

Launch

Build Your Own Agent

LLM Providers

Useful Commands

Legal Boundary

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pokemon LLM Harness

Features

Requirements

Setup

Launch

Build Your Own Agent

LLM Providers

Useful Commands

Legal Boundary

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages