Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http
|---|---|
| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. |
| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. |
| [`experiment-setup`](plugins/mixpanel-mcp/skills/experiment-setup/) | Coaches an experimenter through designing a Mixpanel experiment before launch — hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), and pitfall avoidance. |
| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. |
| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. |

Expand All @@ -30,21 +31,23 @@ claude plugin marketplace add mixpanel/ai-plugins
2. Install the plugin for your region:

**US**

```bash
claude plugin install mixpanel-mcp
```

**EU**

```bash
claude plugin install mixpanel-mcp-eu
```

**India**

```bash
claude plugin install mixpanel-mcp-in
```


### Cursor

Install the plugin from the Cursor marketplace, or have a team admin import this GitHub repository as a team marketplace (Dashboard → Settings → Plugins → Import).
Expand Down
151 changes: 151 additions & 0 deletions plugins/mixpanel-mcp-eu/skills/experiment-setup/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
name: experiment-setup
description: "Coach an experimenter through designing a Mixpanel experiment before launch — hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), and pitfall avoidance. Use when the user wants to set up, configure, design, plan, or sanity-check a new A/B test, feature-flag experiment, or growth experiment. Also trigger on phrasings like 'help me set up an experiment', 'design an A/B test', 'should this be sequential or fixed', 'what MDE can I detect', 'how long should this run', 'is my experiment configured correctly', 'pre-launch checklist', 'should I use CUPED / Winsorization / Bonferroni', 'is this an experiment or just a feature flag', or when the user names a specific feature they want to test. Do NOT use for post-launch results analysis ('how did experiment X do?', 'should we ship?', 'why is SRM failing?') — that belongs to the `experiment-results` skill. Do NOT use for plain feature-flag rollouts with no measurement criterion — that belongs to the `feature-flags` skill."
license: Apache-2.0
---

# Experiment Setup

Coach the user through designing a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, and the sample size + traffic dictate duration and testing model. Reach into `references/` only when a step needs depth.

## Requirements

- Access to Mixpanel (event schema, run queries, create experiments and feature flags).
- Access to a prior-experiments lookup when one is available — the skill works without it, but degrades gracefully and tells the user what it skipped.

## When to use this skill

Trigger on any of:

- "Set up / design / configure / plan an experiment on `<feature>`."
- "Help me write a hypothesis."
- "What MDE can I detect with my current traffic?"
- "Should this be sequential or fixed-horizon?"
- "Should I enable CUPED / Winsorization / Bonferroni?"
- "How long should this experiment run?"
- "Is this an experiment or should I just ship a feature flag?"
- "Sanity-check / pre-launch / pitfall-check this experiment configuration."

Do **not** trigger for post-launch analysis ("how did experiment X do?") — that's the `experiment-results` skill.

---

## Pre-flight: route and check for prior work

**Route XP vs FF before designing.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill-switch, or per-segment gating with no decision criterion → feature flag (route to the `feature-flags` skill). If ambiguous, ask once: "Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?" Deeper disambiguation in [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md).

**Always search for prior experiments on the same feature first** (by keyword from the feature name, when the lookup is available). Surface anything you find — re-running settled questions wastes traffic, and prior baseline/variance numbers sharpen the new MDE. See [references/prior-experiments.md](references/prior-experiments.md) for the fold-in playbook.

---

## Workflow: 4 steps

Run in order. Each step's output is the next step's input.

### Step 1 — Write the hypothesis

A good hypothesis is a **falsifiable directional claim with a stated mechanism**:

> **If** `<change>`, **then** `<measurable outcome>` will `<direction>`, **because** `<mechanism>`.

If vague, hold the user to four commitments: the change, the primary metric, the direction, and the smallest effect worth shipping (the MDE). The "because" forces them to check whether the metric they picked is actually downstream of the change. Deeper rubric and misalignment patterns in [references/hypothesis-framing.md](references/hypothesis-framing.md).

### Step 2 — Pick metrics that test the hypothesis

Each metric serves one role:

- **Primary (1–3 max).** Decides ship/no-ship. Comes from the hypothesis's outcome clause. Each additional primary inflates the false-positive rate.
- **Guardrail (0+, strongly recommended).** Must not regress. A >5% relative regression on any guardrail blocks ship even if the primary wins.
- **Secondary (0+).** Diagnostic only. Never decisional.

Every primary and guardrail needs an explicit `direction` (`"up"` or `"down"`). The default `"up"` is wrong for cancel / error / latency / abandon metrics — leaving it default silently flips polarity at interpretation. Watch for **lagging-metric / window mismatch** (30-day retention as primary on a 2-week experiment) and the **changed-denominator** trap (metric defined only over treatment-exposed users). Full sanity checklist in [references/metric-selection.md](references/metric-selection.md).

### Step 3 — Size the experiment with historical data

Pull baseline rate, variance, and daily traffic from Mixpanel; don't guess. The standard formula (two-sample, two-sided, 95% confidence, 80% power):

```
n = 16 × σ² / d² (per variant; Bernoulli σ² = p(1−p))
```

Inverted for traffic-bound teams — the smallest detectable effect at your traffic:

```
MDE = 4σ / √n
```

If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface this immediately (winner's curse, etc.). Sample-size floor: never below ~350–400 per variant (CLT breaks down, SRM check gets noisy). Worked examples, baseline-lookup table, and the five remediations for underpowered experiments are in [references/sizing.md](references/sizing.md).

### Step 4 — Pick testing model + end condition

**Default to `sequential`** for most users. Peeking is the most common customer mistake; sequential makes early-look safe. Override to `frequentist` for small-lift hunts on well-sized experiments, or when the team needs t-test familiarity.

**End condition.** `sample_size` when daily traffic is variable; `days` when the primary metric has strong weekly seasonality. Frequentist + days is supported — don't flag it.

**Confidence level.** Default 0.95. Bump to 0.99 only for irreversible high-stakes ships; drop to 0.90 only for exploratory low-stakes tests (and tell the user the family-wise FPR is inflated).

**Multiple-testing correction.** Auto-needed when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. Default to `"benjamini-hochberg"`; use `"bonferroni"` for strict family-wise control (regulatory). Without correction the family-wise FPR climbs fast — 5 primaries × 3 variants → ~54%.

Full decision tree and worked numbers in [references/statistical-model.md](references/statistical-model.md).

---

## Advanced features (rationale: [references/advanced-features.md](references/advanced-features.md))

- **CUPED** — variance reduction. Enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history exists. Do not enable on new-user-only experiments or one-time-event metrics.
- **Winsorization** — caps extreme values. Enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli metrics. Push back if `percentile < 80`.

## Pre-launch pitfall check

Before the user creates the experiment, run the pitfall catalogue in [references/pitfalls.md](references/pitfalls.md). Surface only what fires on the current config; order blockers → warnings → fyi.

Two **blockers** that should stop launch:

- `underpowered_duration_insufficient` — expected exposures < 50% of per-arm sample size for the configured MDE.
- `cohort_too_small` — eligible cohort < `num_arms × target_sample_size`.

The **>5% guardrail hard-gate** rationale: a 5% relative regression on any guardrail blocks ship even if the primary wins. A winning primary with a regressing guardrail trades headline lift for damage to something the team explicitly said must not regress — not a ship.

---

## Output

Present a compact summary the user confirms before you create the experiment:

```
*Experiment Setup Summary*

• *Hypothesis:* If <change>, then <metric> will <direction> by ≥<MDE>, because <mechanism>.
• *Primary metrics:* <name> (direction: up/down), …
• *Guardrails:* <name> (direction: …), …
• *Variants:* control 50% / treatment 50% (or as configured)
• *Statistical model:* sequential | frequentist
• *End condition:* sample_size (per-arm <N>) | days (<N> days)
• *Confidence level:* 0.95
• *Multiple testing correction:* benjamini-hochberg | bonferroni | off
• *Advanced features:* CUPED on/off · Winsorization on/off (percentile <P>)
• *Expected duration on current traffic:* <D> days
• *Achievable MDE on current traffic:* <X>% relative

*Pitfall check:*
✅ Underpowered duration — adequate
✅ Cohort size — adequate
⚠️ <pitfall name> — <short explanation>
```

Wait for explicit confirmation before creating the experiment.

## Writing style

- Lead with the hypothesis. Every other decision flows from it.
- Use concrete numbers from real data ("baseline 4.2%, σ² = 0.040, required n ≈ 6,400/arm"), not vague guidance.
- Quote the user's MDE and metric names back so they catch typos.
- When underpowered, say so plainly and list remediations in order of cost.
- Don't moralise about peeking — switch them to sequential.
- Guardrail regressions are hard gates, not "slight concerns."

## Related skills

- `experiment-results` — post-launch analysis. Use after the experiment ships.
- `feature-flags` — pure rollout / kill-switch / gating without measurement.
- `create-dashboard` — live monitoring of primary + guardrail metrics during the experiment.
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Advanced features

Three optional settings most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong.

## CUPED — variance reduction

**What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample.

**Setting:** `settings.cuped.enabled = true`, with `settings.cuped.preExposureDatePreset` choosing the pre-exposure window.

### When to enable

- The primary metric correlates with users' pre-exposure behaviour on the same metric. Strong correlations: revenue, engagement (events per user), retention, time-on-platform. Weak correlations: anything one-time or onboarding-specific.
- **All experiment users existed before the experiment start** — i.e., not a new-user-only cohort. CUPED needs a pre-exposure observation period; new users don't have one.
- A 2–4 week pre-exposure window is available with stable behaviour. If the metric was launched 5 days ago, CUPED has nothing to read.

### When NOT to enable

- New-user-only experiments. No pre-exposure data exists. CUPED gives zero variance reduction and adds noise.
- Brand-new metrics without historical data.
- Metrics where pre-exposure behaviour is not predictive of post-exposure (e.g., one-time onboarding events: the user either did or didn't complete onboarding once; pre-exposure has nothing to say about it).
- Pre-exposure window short enough that the behaviour you'd "control for" is itself a transient spike (e.g., metric just had a viral moment last week).

### Pre-exposure window presets

- `"2-weeks"` — fast-moving metrics with no strong weekly seasonality.
- `"4-weeks"` — most metrics with weekly seasonality (default sweet spot).
- `"60-days"` — deeply seasonal metrics like spend.
- `"90-days"` — long-cycle metrics (renewal-driven revenue, etc.).

### What changes downstream

- Required sample size shrinks by the variance-reduction factor. A 50% variance reduction on a primary that needed 60k per arm shrinks the target to ~30k per arm.
- The point estimate of the lift is unchanged. CUPED is a variance-reduction technique, not a bias correction; the headline lift is the same, the confidence interval is narrower.
- The post-launch interpretation step needs to know CUPED was on, because the standard error formula differs. The setting is persisted on the experiment object; the interpretation step reads it automatically.

## Winsorization — outlier handling

**What it does.** Caps extreme values at a percentile boundary (default 95th, i.e. cap the top 5% and bottom 5% at the 95th and 5th percentile values respectively). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean.

**Setting:** `settings.winsorization.enabled = true`, with `settings.winsorization.percentile` choosing the cap point.

### When to enable

- Revenue or spend metrics with whales (one customer spends 100× the median; that customer assigned to treatment is enough to swing the headline).
- Time-on-page or session-duration metrics with users who fall asleep on the page (one session at 8 hours dwarfs 10,000 sessions at 30 seconds).
- Any Gaussian-distributed metric with a heavy right tail (count metrics, event volume per user, page view counts).

### When NOT to enable

- Bernoulli (conversion) metrics. Capping a 0/1 outcome is meaningless; the 95th percentile of a 0/1 distribution is also 0 or 1.
- Metrics where the tail behaviour **is** the hypothesis. If the test is "did this change move whale spending?", Winsorization throws away exactly the signal you're testing for.
- Metrics already winsorized upstream (in the metric definition / data pipeline) — double-winsorization adds nothing.

### Percentile guidance

Default is 95 (cap top/bottom 5%). This is almost always right. Push back if the user sets `percentile < 80` — that's >20% of values being capped, which throws away too much signal. Confirm intent before launching.

For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. 95 is the default for a reason.

### What changes downstream

- Variance on the affected metric drops, often substantially. Required sample size shrinks accordingly.
- The point estimate of the mean shifts toward the centre of the distribution. This is the desired behaviour; the whole point is to stop a few outliers from anchoring the estimate.
- The post-launch interpretation step reports the winsorized mean and standard error. If the team also wants to know what the un-winsorized mean did (the "did whales react?" question), they'd need a separate secondary metric without Winsorization.

## Multiple testing correction — Bonferroni vs Benjamini-Hochberg

Covered in detail in `references/statistical-model.md`. The short version:

- Enable when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`.
- Default to `"benjamini-hochberg"`. More powerful with correlated primaries.
- Use `"bonferroni"` when family-wise error control is required (regulatory, etc.) or when the primaries are independent.
- Set `"off"` only with a single primary and a single non-control variant.

## Decision flowchart

```
Primary metric is Bernoulli (conversion rate)?
├── Yes → Winsorization OFF.
│ Does it correlate with pre-exposure behavior of existing users?
│ ├── Yes → CUPED ON (if 2-4 week pre-exposure window available, no new-user cohort)
│ └── No → CUPED OFF
└── No (continuous / count / retention)
Heavy-tailed distribution with outliers (revenue, time-on-page, session length)?
├── Yes → Winsorization ON (default percentile = 95)
└── No → Winsorization OFF
Does it correlate with pre-exposure behavior of existing users?
├── Yes → CUPED ON (if 2-4 week pre-exposure window available, no new-user cohort)
└── No → CUPED OFF

Primary count ≥ 2 OR non-control variants ≥ 2?
├── Yes → Multiple testing correction ON ("benjamini-hochberg" default; "bonferroni" for strict family-wise control)
└── No → Multiple testing correction OFF
```

## Common misconfigurations

- ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test.
- ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse.
- ⛔ **Winsorization at percentile < 80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent.
- ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise.
- ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise.
Loading