Skip to content

Add experiment-setup skill#24

Open
elliotrfeinberg wants to merge 5 commits into
mainfrom
experiment-setup-skill
Open

Add experiment-setup skill#24
elliotrfeinberg wants to merge 5 commits into
mainfrom
experiment-setup-skill

Conversation

@elliotrfeinberg
Copy link
Copy Markdown

Summary

Adds a new experiment-setup skill at plugins/mixpanel-mcp/skills/experiment-setup/, ported from the staging branch in mixpanel/analytics#95516.

What the skill does

A single skill covering setup-phase expertise for Mixpanel experiments — hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), XP-vs-FF routing, prior-experiment reuse, and pre-launch pitfalls. Replaces the per-capability tools listed in MULTI-581 (get_experiment_setup_guidance, check_statistical_config, recommend_advanced_features, parse_hypothesis, run_pre_launch_checks, explain_health_check).

Structured for progressive disclosure:

  • SKILL.md — skill entry point. Routes to references on demand.
  • references/hypothesis-framing.md — coaches the "Changing X will increase Y because Z" structure.
  • references/metric-selection.md — primary / guardrail / secondary roles, lagging-indicator trap.
  • references/sizing.md — MDE, power, baseline rate, traffic — Kohavi's inverted formula.
  • references/statistical-model.md — sequential vs fixed-horizon, confidence level.
  • references/advanced-features.md — CUPED, Winsorization, Bonferroni / Benjamini-Hochberg.
  • references/routing-xp-vs-ff.md — when the user wants an experiment vs a plain feature flag.
  • references/prior-experiments.md — how to fold prior results into a new design.
  • references/pitfalls.md — the deterministic check catalogue and the >5% guardrail hard-gate rationale.
  • evals/ — three fixtures (Pelando, Confetti, Polarsteps) seeded from PRD scenarios.

Content ports from ai/engine/tools/experiments/_guidance/setup.md and _shared/pitfall_prose.py in mixpanel/analytics.

Sync

mixpanel-mcp-eu and mixpanel-mcp-in are synced via make sync-skills FORCE=1. make check-skills-sync passes locally. The top-level README.md skills table includes the new row.

Open follow-ups (from the staging README)

  • Eval fixtures use placeholder customer quotes. The originating bot did not have PRD access; the fixture shape and expected_behavior checklists are real, but the verbatim Pelando / Confetti / Polarsteps quotes need a human pass before merge.
  • Tool-name normalisation. SKILL.md and references/prior-experiments.md reference tool names from MULTI-588 / MULTI-589 (search_prior_experiments, validate_experiment). If those names change before merge, update both.

Test plan

  • Read SKILL.md end-to-end; confirm trigger phrasings don't over-trigger on analyze-experiment requests.
  • Spot-check each reference vs canonical ai/engine/tools/experiments/_guidance/setup.md.
  • Confirm evals/ README and three fixture files are usable as a starting scaffold.
  • Verify the >5% guardrail hard-gate language in references/pitfalls.md matches product's customer-facing copy.

🤖 Generated with Claude Code

Assisted by Claude

elliotrfeinberg and others added 2 commits June 4, 2026 23:43
Authors a single skill covering setup-phase expertise for Mixpanel
experiments: hypothesis framing, metric roles, statistical model,
sizing, advanced features (CUPED / Winsorization / Bonferroni),
XP-vs-FF routing, prior-experiment reuse, and pre-launch pitfalls.
Replaces several per-capability tools per MULTI-581.

Structured for progressive disclosure: the spine lives in SKILL.md,
with deep-dive references for each topic and eval fixtures scaffolded
from PRD customer scenarios (Pelando, Confetti, Polarsteps — real
quotes still need to be filled in).

Synced to mixpanel-mcp-eu and mixpanel-mcp-in via make sync-skills.

Linear: https://linear.app/mixpanel/issue/MULTI-581

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace explicit tool name references (search_prior_experiments,
  validate_experiment, run_pre_launch_checks, Run-Query) with
  agent-agnostic phrasing per the convention from #22. Skills describe
  capabilities ("query the metric", "search prior experiments",
  "the platform's pre-launch validation") rather than specific tool
  calls.
- Drop evals/ directory — this repo doesn't run evals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elliotrfeinberg elliotrfeinberg marked this pull request as ready for review June 5, 2026 01:20
The spine is always loaded; references are lazy. Move spine content
that duplicated reference material out:
- Drop hypothesis examples + 5-item commitment list (in hypothesis-framing.md)
- Drop sizing worked example arithmetic (in sizing.md)
- Collapse Step 4 confidence-level + multiple-testing detail to one
  line each + pointer to statistical-model.md
- Collapse the 3-paragraph "Advanced features" detail to two lines +
  pointer to advanced-features.md
- Drop the enumerated warnings list and edge-cases section (covered
  in pitfalls.md / sizing.md / metric-selection.md)
- Drop the redundant "References" appendix at the bottom (every
  reference is already linked inline from its step)

SKILL.md: 243 → 151 lines. Hypothesis template, metric roles,
sizing formula, statistical-model defaults, pitfall blockers, and
the output summary template all preserved. All 8 references still
linked from the spine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, although same concern I had on #23 with files being duplicated across regions.

Surfaces sibling-skill boundaries (experiment-results for post-launch,
feature-flags for plain rollouts) at routing time. The exclusions
were buried in the body where the routing pass never reaches them.

Sync mixpanel-mcp-eu and mixpanel-mcp-in.

Assisted by Claude
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants