diff --git a/.agents/skills/README.md b/.agents/skills/README.md index cbba333219..d04f48e548 100644 --- a/.agents/skills/README.md +++ b/.agents/skills/README.md @@ -25,6 +25,14 @@ Creates tabbed code examples (TCEs) across 12 client languages (Python, Node.js **Assets**: Contains reference templates and `*_TEST_PATTERNS.md` files for each language in the `assets/` subdirectory. +### `redis-use-case-ports` + +Orchestrates a full Redis use-case implementation across all 9 supported client libraries (`redis-py`, `node-redis`, `go-redis`, Jedis, Lettuce, StackExchange.Redis, Predis, `redis-rb`, `redis-rs`) using a parallel-build + synthesise + audit workflow. + +**Use when**: A new use case (cache-aside, session store, rate limiter, leaderboard, etc.) needs to be ported to all 9 clients with consistent helper APIs, demo behaviour, and prose structure — and you want to use parallel sub-agents rather than implementing serially. + +**Assets**: Contains `brief-template.md` (for parallel build agents), `report-template.md` (structured agent output), `audit-checklist.md` (known bug classes — a living document), `cross-diff-checklist.md` (consistency rules across clients), `redis-conventions.md` (repo-specific layout and Hugo conventions), and `html-template.html` (shared demo UI). + ## Setup The `generate-tce-examples` agent skill requires a very specific setup that includes (1) a clone of the `redis/docs` repo and diff --git a/.agents/skills/redis-use-case-ports/SKILL.md b/.agents/skills/redis-use-case-ports/SKILL.md new file mode 100644 index 0000000000..1c1513fe79 --- /dev/null +++ b/.agents/skills/redis-use-case-ports/SKILL.md @@ -0,0 +1,152 @@ +--- +name: redis-use-case-ports +description: Port a Redis use-case example (cache-aside, session store, rate limiter, leaderboard, etc.) to all 9 supported client libraries in parallel, with cross-client synthesis and audit +--- + +# Redis Use-Case Client Ports + +This skill describes how to implement a single Redis use case across all 9 client libraries (`redis-py`, `node-redis`, `go-redis`, Jedis, Lettuce, StackExchange.Redis, Predis, `redis-rb`, `redis-rs`) — fast, consistently, and with a meaningful end-to-end review. + +It is intended for any agent that can spawn parallel sub-agents and synthesise their outputs. It assumes the host repo is [`redis/docs`](https://github.com/redis/docs) and that each use case lives under `content/develop/use-cases//`. + +## When to use this skill + +Use this skill when: + +- A new Redis use case has been scoped (intro page draft, helper API surface, demo behaviour). +- The same example needs to be implemented across all 9 client libraries with consistent helper API, demo behaviour, and prose structure. +- You want to avoid the consistency drift, missed audits, and per-client divergence that come with implementing 9 ports serially. + +Do NOT use this skill for: + +- Single-language code samples (use [`generate-tce-examples`](../generate-tce-examples/SKILL.md) for tabbed multi-language examples within a single doc page). +- Bug fixes on an existing use case (one targeted edit doesn't need the fan-out). +- Cross-cutting refactors that touch many use cases at once (handle per-use-case, in sequence). + +## Workflow at a glance + +The workflow has six phases. Phase 1 is sequential and human-reviewed; phases 2–6 are mechanical and benefit from parallelism + automated review. + +1. **Reference implementation** (sequential, human-in-the-loop) — One language (default: `redis-py`). Establishes conventions and acts as the spec for the other 8. +2. **Parallel build** — Spawn 8 sub-agents in one message, one per remaining client. Each follows [`assets/brief-template.md`](assets/brief-template.md) and returns a report per [`assets/report-template.md`](assets/report-template.md). +3. **Synthesis** — Read all 8 reports. Identify cross-cutting issues, spec ambiguities, and retrofit candidates (including for the reference implementation). +4. **Targeted audit** — Spawn code-reviewer sub-agents per bug class from [`assets/audit-checklist.md`](assets/audit-checklist.md), each scanning all 9 implementations for that specific class. +5. **Retrofit** — Apply fixes. Parallel fan-out if mechanical; sequential (in the main thread) if judgement-heavy or needing user input. +6. **Cross-client diff** — Final consistency pass against [`assets/cross-diff-checklist.md`](assets/cross-diff-checklist.md). + +Each phase has a specific output and gate; do not skip phases. + +## Phase 1 — Reference implementation + +Goal: produce one working client implementation that establishes every convention the other 8 will follow. + +Steps: + +1. Draft the use-case landing page (`content/develop/use-cases//_index.md`) following the section skeleton in [`assets/redis-conventions.md`](assets/redis-conventions.md#use-case-landing-page). +2. Implement the reference client. Default to `redis-py` because Python's syntax surfaces design decisions most clearly. +3. End-to-end smoke-test against a local Redis. Verify the demo runs in a browser and the headline behaviours work. +4. **Pause for user review** before continuing to Phase 2. The reference implementation's conventions propagate to 8 parallel sub-agents — getting it right matters. + +Outputs of Phase 1: + +- `content/develop/use-cases//_index.md` (landing page) +- `content/develop/use-cases//redis-py/` (full implementation: guide + helper + primary + demo server) +- A captured helper API surface (method names, return shapes) that all other clients must match + +## Phase 2 — Parallel build + +Goal: produce 8 client implementations in one wall-clock unit instead of eight. + +Spawn 8 `general-purpose` sub-agents in a **single message** (multiple Agent tool calls in one turn so they run concurrently). Each gets: + +- The full brief from [`assets/brief-template.md`](assets/brief-template.md), filled in with the use case specifics. +- A pointer to the reference implementation (path + helper API surface). +- The HTML template at [`assets/html-template.html`](assets/html-template.html), to be inlined in the target language's preferred string-literal style. +- The target client (one of: `nodejs`, `go`, `java-jedis`, `java-lettuce`, `dotnet`, `php`, `ruby`, `rust`). +- A **required** report template (see [`assets/report-template.md`](assets/report-template.md)) the agent must fill in and return. + +Sub-agents must: + +- Smoke-test their implementation against Redis on a unique port (avoid 8080 / 8769 / and other ports commonly in use). +- Run `redis-cli FLUSHDB` before their tests. +- Report stampede-test results (concurrency, primary reads, elapsed ms). +- NOT assume cross-client coordination — each agent works in isolation. + +Outputs of Phase 2: + +- Eight client implementations, each with `_index.md`, helper file, primary file, demo server, and any language-specific config (`go.mod`, `Cargo.toml`, `*.csproj`). +- Eight structured reports collected in the calling agent's context. + +## Phase 3 — Synthesis + +Goal: extract cross-cutting findings that no single sub-agent could see. + +Read all 8 reports plus your memory of the reference implementation. Look for: + +- **Repeated spec ambiguities.** If 3+ agents asked the same question, the spec is unclear. Fix it. +- **Pattern divergence.** Agents that diverged in the same way usually indicate a missing convention. Add it. +- **Cross-client asymmetries that matter.** *"Lettuce flagged connection-scoped transaction state; Jedis didn't mention it because Jedis uses per-call connections from a pool"* — this comparison is only visible at synthesis time. +- **Retrofit candidates for the reference implementation.** Sub-agents often find better idioms; if one applies cleanly to the reference, propose it. + +Decide which findings should: + +- Apply to all clients (retrofit pass in Phase 5). +- Apply to a subset (document why). +- Stay as per-client deviations (document why in the relevant guide's "Production usage" section). + +Surface judgement-heavy decisions to the user before Phase 5. Do not silently apply cross-cutting changes that affect the user-facing API or prose. + +## Phase 4 — Targeted audit + +Goal: catch known bug classes before users do. + +For each row in [`assets/audit-checklist.md`](assets/audit-checklist.md), spawn an `Explore` or code-reviewer sub-agent that scans all 9 implementations for **that specific class**. Audits are sharper when scoped to one bug pattern at a time — broad "look for issues" passes miss the specific stuff. + +Examples of audit prompts: + +- *"Audit every WATCH/MULTI/EXEC site in the 9 client implementations at `content/develop/use-cases//` for whether they are covered by serialisation when the connection is shared. Flag any uncovered site."* +- *"Audit every deadline / timeout calculation in the 9 client implementations for integer overflow. The .NET case uses Environment.TickCount; check that all clients use a clock that doesn't wrap within the relevant window."* + +Append any new bug classes discovered during this audit (or by external review like Cursor's bugbot) to [`assets/audit-checklist.md`](assets/audit-checklist.md). This file is a living document — every future project benefits from the rows added in this one. + +## Phase 5 — Retrofit + +Goal: apply the fixes from synthesis + audit. + +Decision tree: + +- **Mechanical change in same place across all clients** (e.g., rename a method) → spawn N parallel agents, each given the same instruction. +- **Mechanical change with per-client variations** (e.g., apply the equivalent of `TickCount64` for each language's clock) → sequential in the main thread, lower risk of subtle per-client bugs. +- **Judgement-heavy change** (e.g., "add not-found sentinel caching") → discuss with user, then sequential. + +After retrofitting, run smoke-tests for the affected clients again. A retrofit that breaks the smoke test is worse than the original bug. + +## Phase 6 — Cross-client diff + +Goal: confirm the 9 clients are consistent on the things that should be consistent. + +Run a final pass against [`assets/cross-diff-checklist.md`](assets/cross-diff-checklist.md). This can be done by a sub-agent with read-only access. Report any divergences and either justify them (per-language idiom) or fix them. + +## Anti-patterns to avoid + +- **Spawning Phase 2 before user-reviewed Phase 1.** Conventions established in Phase 1 propagate to 8 agents. Getting Phase 1 wrong wastes 8 implementations of work. +- **Treating sub-agent reports as just code.** If you don't ask for structured insight in the brief, you don't get it. Phase 3 synthesis depends on Phase 2 reports being rich. +- **Skipping Phase 4 because everything compiled.** Compilation passes are not bug audits. The Lettuce `txLock`-missing-in-`loadWithSingleFlight` bug and the .NET `Environment.TickCount` 24.9-day wraparound both compiled cleanly and passed smoke tests. Phase 4 exists specifically to catch this kind of issue. +- **Letting `audit-checklist.md` stagnate.** The checklist's value compounds across projects. Every external bug found should be added. +- **Doing Phase 5 retrofit purely sequentially "to be safe."** That erases the time savings of Phase 2. Use the decision tree. + +## Maintenance + +This skill grows over time: + +- New bug classes go into [`assets/audit-checklist.md`](assets/audit-checklist.md). +- New consistency rules go into [`assets/cross-diff-checklist.md`](assets/cross-diff-checklist.md). +- HTML/UI improvements go into [`assets/html-template.html`](assets/html-template.html) once and propagate via the next use case. +- Redis-docs-specific conventions go into [`assets/redis-conventions.md`](assets/redis-conventions.md). + +Keep `SKILL.md` itself focused on the workflow. The concrete artefacts live in `assets/`. + +## Reference projects + +- [`content/develop/use-cases/cache-aside/`](../../../content/develop/use-cases/cache-aside/) — the worked example that produced this skill. Includes the conventions, helper API shape, and prose structure that the templates encode. +- [`content/develop/use-cases/session-store/`](../../../content/develop/use-cases/session-store/) — earlier worked example. Same shape, different helper API. diff --git a/.agents/skills/redis-use-case-ports/assets/audit-checklist.md b/.agents/skills/redis-use-case-ports/assets/audit-checklist.md new file mode 100644 index 0000000000..16e56d0bb1 --- /dev/null +++ b/.agents/skills/redis-use-case-ports/assets/audit-checklist.md @@ -0,0 +1,215 @@ +# Audit checklist — known bug classes + +Phase 4 of the [`redis-use-case-ports`](../SKILL.md) workflow runs one audit sub-agent per row in this checklist, each scanning all 9 client implementations for **that specific class**. + +This file is a **living document**. Every time external review (Cursor bugbot, human code review, a bug filed by a user) catches a class of bug in one of these implementations, add a row. Future projects benefit automatically. + +Each row has: + +- **Class** — short name. +- **What to scan for** — the specific pattern to grep / read for. +- **Pass criterion** — what the right answer looks like. +- **Sample audit prompt** — copy-paste-able prompt for the audit sub-agent. +- **Why this is on the list** — provenance. + +--- + +## 1. WATCH/MULTI/EXEC coverage when a connection is shared + +**What to scan for:** every site that issues `MULTI`, `WATCH`, transaction-style command blocks, or `pipeline()` with `.atomic()` semantics. + +**Pass criterion:** if the client uses a single shared connection across threads/tasks (e.g. Lettuce `StatefulRedisConnection`, redis-rs `ConnectionManager` without pooling), every transaction block must be serialised — either by an in-process lock, or by checking out a dedicated connection from a pool per transaction. If the client uses per-call connection acquisition (e.g. JedisPool, ConnectionMultiplexer), no extra lock is needed but the transaction must complete within a single acquired connection. + +**Sample audit prompt:** + +> Audit every WATCH/MULTI/EXEC and transactional pipeline site across all 9 client implementations under `content/develop/use-cases/{{USE_CASE_NAME}}/`. For each site, identify whether the connection is shared across concurrent callers. If shared, confirm the transaction is serialised. Flag any uncovered site with file path and line number. Read at least the helper file in each client directory. + +**Why on list:** Cursor bugbot caught a missing `txLock` around `loadWithSingleFlight`'s repopulate `MULTI`/`EXEC` in the cache-aside Lettuce port. The `updateField` site had the lock; the repopulate site didn't. ([PR #3291 comment](https://github.com/redis/docs/pull/3291#discussion_r3209490076)) + +--- + +## 2. Deadline arithmetic overflow + +**What to scan for:** any place a "deadline" or "until" value is computed by adding a duration to a monotonic clock. Especially `Environment.TickCount`, `os.clock()`, anything that returns 32-bit milliseconds. + +**Pass criterion:** the clock used has a wrap interval much greater than the deadline window. For 32-bit millisecond counters (~24.9-day wrap), the deadline calculation must use a 64-bit clock variant (`Environment.TickCount64`, `time.monotonic_ns`, `System.nanoTime`, `Instant`). + +**Sample audit prompt:** + +> Audit every deadline or timeout calculation in the 9 client implementations under `content/develop/use-cases/{{USE_CASE_NAME}}/`. For each, identify the clock source and its wrap interval. Flag any case where the clock could wrap between `start + delta` and the loop comparison. Particularly check .NET (`TickCount` vs `TickCount64`), Java (`currentTimeMillis` vs `nanoTime`), Rust (`Instant`), Python (`time.monotonic`), Go (`time.Now`). + +**Why on list:** Cursor bugbot caught `Environment.TickCount + _lockTtlMs` in the cache-aside .NET port — wraps to a negative value every 24.9 days, immediately exiting the poll loop and defeating stampede protection at the wraparound boundary. ([PR #3291 comment](https://github.com/redis/docs/pull/3291#discussion_r3209490088)) + +--- + +## 3. Lua single-flight lock release with caller token + +**What to scan for:** the release script for the single-flight lock. + +**Pass criterion:** the release script must check that the caller still owns the lock before deleting. The pattern is: + +```lua +if redis.call('GET', KEYS[1]) == ARGV[1] then + return redis.call('DEL', KEYS[1]) +end +return 0 +``` + +A naked `DEL` without the token check is a bug: if the lock expired and was re-acquired by another caller, the original caller's unconditional `DEL` would release someone else's lock. + +**Sample audit prompt:** + +> Audit the Lua lock release path in every client implementation under `content/develop/use-cases/{{USE_CASE_NAME}}/`. Confirm each one checks the unique caller token before deleting. Flag any naked `DEL` of the lock key. Also confirm the token is unique per caller (not shared via a constant). + +**Why on list:** Standard Redlock-style pattern. Easy to forget, especially when porting from a reference that has it. + +--- + +## 4. ThreadPool / runtime ramp-up under synchronous polling + +**What to scan for:** stampede polling loops that block on `Thread.Sleep`, `usleep`, `tokio::time::sleep`, etc., when the spawned tasks rely on an automatically-growing thread pool. + +**Pass criterion:** either (a) the polling is async/non-blocking, (b) the runtime is pre-warmed (e.g. `ThreadPool.SetMinThreads`), or (c) the spawn count is bounded to fit within typical runtime defaults. A synchronous poll loop in a runtime with slow worker ramp-up will time out and trigger fall-through to the loader, breaking stampede protection. + +**Sample audit prompt:** + +> Audit how each client's stampede test spawns concurrent callers and whether the polling/wait path uses synchronous blocking. For runtimes with elastic thread pools (.NET, Java with `Executors.newCachedThreadPool`, Python with default `ThreadPoolExecutor`), confirm the spawn count is matched by an appropriately sized pool, OR the runtime has been pre-warmed. Flag any synchronous blocking poll loop running in a starved pool. + +**Why on list:** .NET cache-aside port initially produced 2 primary reads instead of 1 for 25-concurrent stampede — the .NET ThreadPool was growing by ~2 threads/sec and the polling threads were starving. Fixed by `ThreadPool.SetMinThreads(64, 64)` at startup, documented in the production-usage section. + +--- + +## 5. Stateless-runtime state assumptions + +**What to scan for:** in-process state (counters, caches, primary record store) in any client whose runtime is request-per-process (PHP under `php -S`). + +**Pass criterion:** state that must persist across requests lives outside the process (typically in Redis under a `demo:*` keyspace) for stateless runtimes. + +**Sample audit prompt:** + +> Identify which client implementations run under a stateless request-per-process model. For those, audit any in-process state (counters, primary record store) and confirm it lives in Redis or some other shared store. Flag any in-process state that wouldn't survive a request boundary. + +**Why on list:** PHP cache-aside port — stats counters in process memory wouldn't survive across `php -S` requests; primary record updates wouldn't either. Resolved by storing everything in Redis under `demo:cache_stats` and `demo:primary:*`. + +--- + +## 6. Stampede test concurrency vs lock TTL + +**What to scan for:** the relationship between (a) the stampede test's concurrency limit, (b) the lock TTL, (c) the primary read latency. + +**Pass criterion:** lock TTL > (primary read latency × safety margin). Concurrency × (poll interval + Redis RTT) << lock TTL, otherwise polling callers fall through. Document the math somewhere in the production-usage section. + +**Sample audit prompt:** + +> For each client implementation, identify the lock TTL, the primary read latency, the polling interval, and the maximum stampede concurrency the demo allows. Confirm the inequality: lock TTL > primary read latency × 2, and (concurrency × Redis RTT) << lock TTL. Flag any client where these don't hold. + +**Why on list:** Multiple ports needed a higher lock TTL or smaller wait_poll_ms to keep all stampede callers within the polling window. Easy to break by changing one constant in isolation. + +--- + +## 7. Negative-result caching policy + +**What to scan for:** what the helper does when `loader()` returns null/None/nil for a non-existent key. + +**Pass criterion:** behaviour is consistent across clients and documented. Either (a) don't cache the negative result (current default — but every probe re-hits primary), or (b) cache a short-TTL sentinel with explicit invalidation on creation. Whichever is chosen, all 9 clients must agree. + +**Sample audit prompt:** + +> For each client, identify the behaviour when `loader` returns a not-found result (`null`, `None`, `nil`, `Option::None`). Confirm all 9 clients have the same policy. Flag any inconsistency. Confirm the `_index.md` "Production usage" section discusses negative caching. + +**Why on list:** Subtle inconsistency risk. Some ports might cache the empty hash, some might not — the reference's behaviour propagates by accident, not by spec. + +--- + +## 8. Hit-rate calculation precision and integer-division bugs + +**What to scan for:** the formula that produces `hit_rate_pct`. + +**Pass criterion:** when `total == 0`, returns 0.0 without dividing by zero. Otherwise, returns a value rounded to 1 decimal place using consistent rounding (banker's rounding, round-half-up, etc. — pick one and use it everywhere). Beware integer-division bugs: `(100 * hits) / total` in a statically-typed language can truncate if `hits`/`total` are ints. + +**Sample audit prompt:** + +> Compare the `hit_rate_pct` calculation across all 9 client implementations. Run mental tests: total=0, total=3 hits=1, total=7 hits=1, total=10 hits=7. Confirm all 9 produce consistent output (within rounding-mode noise). Flag any zero-division risk or integer-truncation bug. + +**Why on list:** Easy to get subtly wrong per language. The reference output is the test oracle. + +--- + +## 9. Demo HTTP port conflicts + +**What to scan for:** the default port used by the demo server. + +**Pass criterion:** different clients use *different* default ports (or the same default and the user is expected to override). All clients accept `--port` (or equivalent flag/env var). No client hardcodes a port commonly held by other dev tooling (e.g. `8080` is fine, but check `8769` is not on macOS Okta's reserved list). + +**Sample audit prompt:** + +> List the default port used by each of the 9 demo servers and the flags they accept to override. Flag any client that uses an unusual default or doesn't accept an override. + +**Why on list:** During parallel smoke-testing, a leftover demo server on a shared port can hijack subsequent test requests. Each agent should pick a unique port in their brief. + +--- + +## 10. Compiled-artefact cleanup + +**What to scan for:** working tree state after smoke-testing. + +**Pass criterion:** the client directory contains source files + config files only. No `*.class`, `bin/`, `obj/`, `target/`, `node_modules/`, `vendor/`, `Cargo.lock` (unless committed), etc. + +**Sample audit prompt:** + +> For each client implementation, list every file in the directory and flag any compiled artefacts, dependency directories, or lock files that should be in `.gitignore`. Refer to the standard list in `.gitignore` (or build one). + +**Why on list:** Easy to forget after a smoke-test pass. Multiple of the cache-aside ports left build output that had to be cleaned manually. + +--- + +## 11. Token-checked atomic state transitions (multi-step state machines) + +**What to scan for:** any helper that issues a state transition (e.g. `complete`, `fail`, `reclaim`) keyed on a per-claim or per-session token. The pattern involves: HGET the token, compare against the caller's token, then LREM/LPUSH/HSET/EXPIRE on multiple keys. + +**Pass criterion:** the token check **and** the dependent multi-key writes must run inside a **single Lua script**, not split between client-side code and a follow-up pipeline. A client-side `HGET → if matches → MULTI/EXEC` pattern has a TOCTOU window: between the HGET and the EXEC, the reclaimer can move the job and a new worker can re-claim it, at which point the old worker's `LREM` removes the new claimant's processing entry and the `LPUSH` puts a duplicate ID into pending/completed/failed. + +**Sample audit prompt:** + +> Audit complete/fail/reclaim (or equivalent) in all 9 client implementations of `content/develop/use-cases/{{USE_CASE_NAME}}/`. For each, verify (a) the token check (`HGET claim_token` vs caller-token) happens **inside the same Lua script** as the LREM and the LPUSH, not in client code; (b) the script returns a non-success value when the token doesn't match; (c) the reclaim path **clears the token and resets the claim timestamp** atomically when it moves a job back to pending, otherwise a worker finishing its hang right after reclaim could still pass the token check on the next call. Flag any uncovered site. + +**Why on list:** Job-queue use case, Phase 4 audit. Same shape as row 3 (single-flight lock with token) but generalised to multi-step state machines where the token guards a `LREM + HSET + LPUSH` sequence, not just a `DEL`. + +--- + +## 12. Crash-window fallback timer on claimed-but-unwritten state + +**What to scan for:** any helper where a worker `BLMOVE`s (or `BRPOPLPUSH`es) an ID from a pending list into a processing list, then **immediately afterward** writes `claimed_at_ms` / `claim_token` via a follow-up call. There is a small window (microseconds, but real) where the worker process can die between the move and the write, leaving the ID stranded in the processing list with no `claimed_at_ms` for the reclaim sweep to compare against. + +**Pass criterion:** the reclaim script must include a fallback path that recovers the stranded job. The standard shape is: a job is stale if **either** `claimed_at_ms > 0 AND (now - claimed_at_ms) > visibility_ms` **or** `claimed_at_ms == 0 AND (now - enqueued_at_ms) > 2 × visibility_ms`. The fallback timer uses `enqueued_at_ms` (which was written by `enqueue()`) and a longer threshold (typically 2×) so it doesn't fire spuriously against a worker that's only milliseconds behind on writing its metadata. + +**Sample audit prompt:** + +> For each of the 9 client implementations in `content/develop/use-cases/{{USE_CASE_NAME}}/`, locate the reclaim Lua script. Confirm it handles the case where `claimed_at_ms` is missing or zero by falling back to `enqueued_at_ms` plus a longer threshold (typically `2 × visibility_ms`). Also confirm `enqueue()` writes `enqueued_at_ms` (so the fallback has something to compare against) and the reclaim path resets `claimed_at_ms` to 0 and clears `claim_token` when moving a job back to pending. Flag any port where the fallback is missing — those ports can lose work to a single kill-9. + +**Why on list:** Job-queue use case, Phase 4 audit. The race window is tiny but real, and the fix is mechanical (a few extra Lua lines). Every job-queue-style use case ported in the future should preserve this pattern. + +--- + +## 13. Shared-keyspace collision during parallel smoke tests + +**What to scan for:** the helper's default key prefix (cache name, queue name, namespace). When Phase 2 fans out 8 sub-agents in parallel, each runs its own demo server against the same local Redis. If every demo uses the same default prefix (`cache:product:*`, `queue:jobs:*`, etc.), the agents stomp on each other's state and the smoke tests become unreliable. + +**Pass criterion:** the helper is parameterised on a name/prefix argument, and the demo server exposes a `--queue-name` (or `--key-prefix`, `--cache-name`, etc.) flag / env var. Each Phase 2 agent uses a port-specific suffix during its smoke tests (e.g. `jobs-nodejs`, `jobs-go`) but the **default** in the shipping code remains unsuffixed so end-user docs and `redis-cli` examples are unchanged. The helper's `purge()` must scope its `SCAN MATCH` to `:{name}:*` using the parameterised name, not a hard-coded constant. + +**Sample audit prompt:** + +> For each of the 9 client implementations of `content/develop/use-cases/{{USE_CASE_NAME}}/`, verify the helper accepts a name/prefix parameter and the demo server exposes a CLI flag to override the default. Confirm `purge()` (or equivalent reset path) uses `SCAN MATCH` against the parameterised prefix, not a hard-coded literal. Flag any port where the default is hard-coded all the way through, since that port can't run smoke tests safely alongside its siblings. + +**Why on list:** Unanimous finding across all 8 sub-agent reports in the job-queue use case — every agent had to add their own queue-name flag mid-port to avoid colliding with the other 7 demos running against the same Redis. The brief should call this out from the start so it doesn't reappear. + +--- + +## How to add a new row + +When a bug class is identified after this skill has been used: + +1. Pick the next number. +2. Fill in **Class**, **What to scan for**, **Pass criterion**, **Sample audit prompt**, **Why on list**. +3. The "Why on list" entry should cite the source (PR comment URL, issue number, etc.) so future maintainers can verify the rule is still relevant. +4. Test the audit prompt by running it against a known-good codebase — it should produce 0 findings. A prompt that produces false positives is worse than no prompt. diff --git a/.agents/skills/redis-use-case-ports/assets/brief-template.md b/.agents/skills/redis-use-case-ports/assets/brief-template.md new file mode 100644 index 0000000000..f07fc1213a --- /dev/null +++ b/.agents/skills/redis-use-case-ports/assets/brief-template.md @@ -0,0 +1,108 @@ +# Parallel build agent brief — template + +Copy this template, fill in the `{{PLACEHOLDERS}}`, and pass the result to one sub-agent per target client in Phase 2 of the [`redis-use-case-ports`](../SKILL.md) workflow. + +Each sub-agent runs in isolation and produces one client implementation. Send all sub-agent invocations in a **single message** so they run concurrently. + +--- + +## Brief + +You are implementing the **`{{CLIENT}}`** client port of the **`{{USE_CASE_NAME}}`** Redis use case for the [`redis/docs`](https://github.com/redis/docs) repo. + +### Context + +This use case is being implemented across 9 client libraries in parallel. The Python (`redis-py`) reference implementation already exists at: + +- `{{REFERENCE_PATH}}` (typically `content/develop/use-cases/{{USE_CASE_NAME}}/redis-py/`) +- Reference files: `_index.md`, helper file (e.g. `cache.py`), primary file (`primary.py`), demo server (`demo_server.py`) + +Your output goes under: + +- `content/develop/use-cases/{{USE_CASE_NAME}}/{{CLIENT}}/` + +### What the helper must do + +Match the reference implementation's API surface exactly (method names, return shapes, behaviour). The reference is the spec. + +For this use case the helper exposes: + +{{HELPER_API_SURFACE}} + +### File layout for `{{CLIENT}}` + +Follow the conventions in [`assets/redis-conventions.md`](redis-conventions.md#per-client-file-layout). At a minimum: + +- A `_index.md` Hugo guide page following the section skeleton in `redis-conventions.md`. +- A helper class/module file (e.g., `Cache.cs`, `cache.go`, `RedisCache.java`). +- A `MockPrimaryStore` (or equivalent) with the same product records as the reference. +- A demo server that exposes HTTP endpoints: `GET /`, `GET /products`, `GET /read?id=`, `GET /stats`, `POST /invalidate`, `POST /update`, `POST /stampede`, `POST /reset`. +- Any language-specific config (`go.mod`, `Cargo.toml`, `*.csproj`, etc.). + +### HTML template + +Inline the shared HTML page in your demo server in your language's preferred string-literal style (e.g. multi-line `+`-concatenated string in Java, raw `r##"..."##` in Rust, here-doc in Ruby, etc.). The template with `{{OPTIONS}}`, `{{PRIMARY_LATENCY}}`, and `{{CACHE_TTL}}` placeholders lives at: + +- `.agents/skills/redis-use-case-ports/assets/html-template.html` + +Update only the **pill text** at the top of `` to reflect your client library + HTTP framework (e.g., `node-redis + Node.js standard http module`). + +### Stampede protection + +The helper must use a Lua single-flight lock to funnel concurrent cache misses through a single primary read. Both scripts live in the reference implementation: + +```lua +-- Acquire (returns 1 if lock acquired, 0 otherwise) +if redis.call('SET', KEYS[1], ARGV[1], 'NX', 'PX', ARGV[2]) then + return 1 +end +return 0 + +-- Release (only if we still own the lock) +if redis.call('GET', KEYS[1]) == ARGV[1] then + return redis.call('DEL', KEYS[1]) +end +return 0 +``` + +Tokenise the lock with a per-caller random value (8+ bytes hex). The release script must check the token. + +### Smoke tests you MUST run before reporting + +1. Start Redis if not running. Run `redis-cli FLUSHDB` to clear state. +2. Start your demo server on a **unique port** (avoid 8080, 8769, ports 8770–8775 if other agents may be using them; pick something in 8780–8830). +3. `GET /read?id=p-001` → expect cache miss, latency ≥ primary latency, `hit: false`. +4. `GET /read?id=p-001` → expect cache hit, latency << primary latency, `hit: true`. +5. `POST /invalidate id=p-001`. +6. `POST /update id=p-002 field=stock value=99`. Then `GET /read?id=p-002` should return updated record with `hit: false` (the update invalidates). +7. `POST /stampede id=p-003 concurrency=20`. **Expect exactly 1 primary read for 20 concurrent callers.** Anything else is a stampede-protection failure that must be fixed before you report. +8. `POST /reset` and verify stats zero out. +9. Stop the demo server. Clean up any compiled artefacts (`*.class`, `bin/`, `obj/`, `target/`, `node_modules/`) so the working tree is clean. + +### Documentation conventions + +- **Source links use full GitHub blob URLs**, not relative paths. The `_index.md` is rendered to HTML and served from `redis.io/docs/...`, so a `[source](cache.py)` link 404s in production. Use `[source](https://github.com/redis/docs/blob/main/content/develop/use-cases/{{USE_CASE_NAME}}/{{CLIENT}}/cache.py)` instead, with the full path to *this* port's folder. +- **`## Running the demo` opens with a `### Get the source files` subsection** containing a `curl` block that pulls the port's files from GitHub raw URLs into a fresh local directory, followed by a `### Start the demo server` subsection with the run command. See the "Source links and 'Get the source files' block" section of [`assets/redis-conventions.md`](redis-conventions.md) for the exact shape. + +### Conventions to follow + +- Method names match the reference (`get`, `invalidate`, `update_field`/`updateField`, `stats`, `reset_stats`/`resetStats`, `ttl_remaining`/`ttlRemaining`). +- Stats response shape: `{ hits, misses, stampedes_suppressed, hit_rate_pct, primary_reads_total, primary_read_latency_ms }`. +- Read response shape: `{ id, record, hit, redis_latency_ms, total_latency_ms, ttl_remaining, stats }`. +- Stampede response shape: `{ id, concurrency, primary_reads, elapsed_ms, results, stats }`. +- Cache key prefix `cache:product:` (or whatever the reference uses for this use case). +- Lock key prefix `lock:cache:product:` (parallel to cache key). +- TTL in seconds; lock TTL in milliseconds. Configurable via constructor args. +- See [`assets/redis-conventions.md`](redis-conventions.md) for the full convention list. + +### Stop before you report if + +- You can't make stampede produce exactly 1 primary read for 20 concurrent callers. +- You can't get the demo to compile/run end-to-end. +- The reference implementation's API doesn't translate cleanly to your language and you need a deviation that you can't justify with a clear one-line rationale. + +In any of those cases, report what you tried and why it doesn't work. Do NOT silently ship a broken implementation. + +### Report format + +When you're done (or stuck), return a single response in the format described in [`assets/report-template.md`](report-template.md). Fill in every field. The report drives the synthesis phase — terse "it worked" reports are not useful. diff --git a/.agents/skills/redis-use-case-ports/assets/cross-diff-checklist.md b/.agents/skills/redis-use-case-ports/assets/cross-diff-checklist.md new file mode 100644 index 0000000000..83ce2447d1 --- /dev/null +++ b/.agents/skills/redis-use-case-ports/assets/cross-diff-checklist.md @@ -0,0 +1,224 @@ +# Cross-client diff checklist + +Phase 6 of the [`redis-use-case-ports`](../SKILL.md) workflow runs this checklist across all 9 client implementations to catch consistency drift. + +This is **not** a bug audit (that's `audit-checklist.md`). It's about things that should be **the same** across clients but may have drifted because 9 agents wrote them independently. + +A sub-agent can run this in read-only mode. For each row, produce a 9-column comparison (one column per client) and flag any column that disagrees with the majority. + +--- + +## Helper API surface + +| Concern | What to compare | +|---|---| +| Method names | `get`, `invalidate`, `update_field`, `stats`, `reset_stats`, `ttl_remaining` (camelCase or snake_case per language idiom). All 9 must expose the same set of operations. | +| Method signatures | Argument names + types should match the reference's semantics. Loader callbacks accept an entity ID and return either a record or null/None/nil. | +| Return shapes | The `get` return must include the record, a hit flag, and the measured Redis round-trip latency. The names may be language-idiomatic but the fields must align. | +| Stats shape | `{ hits, misses, stampedes_suppressed, hit_rate_pct, primary_reads_total, primary_read_latency_ms }`. JSON-serialised exactly like this for the demo's `/stats` endpoint. | + +## Redis command set + +| Concern | What to compare | +|---|---| +| Read path | `HGETALL` on every hit-check, no client should use individual `HGET`s. | +| Write path | `HSET` (with all fields) + `EXPIRE`, ideally pipelined or in a single `HSET ... EXPIRE` MULTI. | +| Invalidate | `DEL` (not `EXPIRE 0`, not `UNLINK`). | +| Field update | `HSET key field value` + `EXPIRE` inside a conditional transaction or `Condition.KeyExists`. | +| TTL inspection | `TTL` (not `PTTL`, not `OBJECT`). | +| Single-flight acquire | Lua script using `SET NX PX`. | +| Single-flight release | Lua script using `GET == token` check + `DEL`. | +| Counters (where stats are in Redis, e.g. PHP) | `HINCRBY`. | + +Any divergence (e.g. someone using `EXPIREAT` instead of `EXPIRE`, or `SET NX EX` instead of `PX`) should be flagged and either justified or fixed. + +## Default values + +| Concern | Standard value | +|---|---| +| Cache key prefix | `cache:product:` (or whatever the reference uses for the use case). | +| Lock key prefix | `lock:cache:product:` (mirror of cache key with `lock:` prefix). | +| Default TTL | 30 seconds. | +| Default lock TTL | 2000 milliseconds. | +| Default wait poll interval | 25 milliseconds. | +| Default primary latency | 150 ms. | +| Sample product IDs | `p-001`, `p-002`, `p-003`, `p-004`. | +| Sample fields | `id`, `name`, `price_cents`, `stock`. | + +Defaults should be overridable via constructor args (helper) or CLI flags / env vars (demo). + +## Demo server HTTP API + +| Endpoint | Method | Behaviour | +|---|---|---| +| `/` | GET | Serves the HTML demo page. | +| `/products` | GET | Returns `{ products: [...ids...] }`. | +| `/read?id=` | GET | Returns read result. 400 if `id` missing, 404 if unknown. | +| `/stats` | GET | Returns the stats object. | +| `/invalidate` | POST | Form `id=...`. Returns `{ id, deleted, stats }`. | +| `/update` | POST | Form `id=...&field=...&value=...`. Updates primary then invalidates cache. Returns `{ id, field, value, stats }`. | +| `/stampede` | POST | Form `id=...&concurrency=N`. Returns `{ id, concurrency, primary_reads, elapsed_ms, results, stats }`. | +| `/reset` | POST | Returns stats with all counters zeroed. | + +All endpoints return JSON (`application/json`) except `/`. Stats embedded in every mutation response so the UI stays current. + +## Response shapes + +`/read` response: + +```json +{ + "id": "p-001", + "record": { "id": "p-001", "name": "...", "price_cents": "650", "stock": "42" }, + "hit": false, + "redis_latency_ms": 1.23, + "total_latency_ms": 105.4, + "ttl_remaining": 30, + "stats": { ... } +} +``` + +`/stampede` response: + +```json +{ + "id": "p-002", + "concurrency": 20, + "primary_reads": 1, + "elapsed_ms": 124.5, + "results": [ + { "hit": false, "redis_latency_ms": 1.1, "found": true }, + ... + ], + "stats": { ... } +} +``` + +Field naming uses `snake_case` in JSON regardless of host language idiom, for consistency with the demo UI's JavaScript. + +## Frontmatter (each `_index.md`) + +Per-client guide page: + +```yaml +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Implement a Redis in with +linkTitle: example () +title: Redis with +weight: +--- +``` + +Landing page (`/_index.md`): + +```yaml +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: +hideListLinks: true +linkTitle: +title: Redis +weight: +--- +``` + +## Section structure of `_index.md` per client + +The guide page must follow this structure (with section titles adapted to the use case): + +1. Lead paragraph (1–2 sentences linking to the client docs). +2. ## Overview +3. ## How it works (numbered steps) +4. ## The cache-aside helper (or equivalent name) — code snippet showing the public API. +5. ### Data model +6. ## The relevant operations (one section per operation: stampede protection, invalidation, field updates, etc.) +7. ## Hit/miss accounting (or stats section, if applicable) +8. ## Prerequisites +9. ## Running the demo +10. ## The mock primary store +11. ## Production usage (with `###` subsections) +12. ## Learn more (links to commands and clients) + +The "Production usage" section should always discuss TTL choice, invalidate-don't-sync, missing-record handling, lock TTL tuning, and namespacing. Other items vary by language (connection pool, async API, etc.). + +## Prose consistency + +The same concept should use the same wording across all 9 guides. Drift examples to look for: + +| Concept | Preferred wording | +|---|---| +| Cache miss → primary lookup | "fall through to the primary" | +| Single-flight | "single-flight lock" (not "mutex" or "leader election") | +| Stampede | "cache stampede" | +| Invalidate | "delete the cache key" (not "expire", not "evict") | +| Not-found loader return | "missing record" | +| Primary | "primary database" or "primary store" | +| TTL bound | "bounded staleness" | + +## HTML demo page + +The visible HTML structure should be identical across clients. Each demo inlines [`assets/html-template.html`](html-template.html) with three placeholders: + +- `{{OPTIONS}}` — product `