diff --git a/README.md b/README.md index 17a8229..ec4650c 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,7 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http |---|---| | [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | | [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | +| [`experiment-setup`](plugins/mixpanel-mcp/skills/experiment-setup/) | Coaches an experimenter through designing a Mixpanel experiment before launch — hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), and pitfall avoidance. | | [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. | | [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | @@ -30,21 +31,23 @@ claude plugin marketplace add mixpanel/ai-plugins 2. Install the plugin for your region: **US** + ```bash claude plugin install mixpanel-mcp ``` **EU** + ```bash claude plugin install mixpanel-mcp-eu ``` **India** + ```bash claude plugin install mixpanel-mcp-in ``` - ### Cursor Install the plugin from the Cursor marketplace, or have a team admin import this GitHub repository as a team marketplace (Dashboard → Settings → Plugins → Import). diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/SKILL.md new file mode 100644 index 0000000..7de583c --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/SKILL.md @@ -0,0 +1,151 @@ +--- +name: experiment-setup +description: "Coach an experimenter through designing a Mixpanel experiment before launch — hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), and pitfall avoidance. Use when the user wants to set up, configure, design, plan, or sanity-check a new A/B test, feature-flag experiment, or growth experiment. Also trigger on phrasings like 'help me set up an experiment', 'design an A/B test', 'should this be sequential or fixed', 'what MDE can I detect', 'how long should this run', 'is my experiment configured correctly', 'pre-launch checklist', 'should I use CUPED / Winsorization / Bonferroni', 'is this an experiment or just a feature flag', or when the user names a specific feature they want to test. Do NOT use for post-launch results analysis ('how did experiment X do?', 'should we ship?', 'why is SRM failing?') — that belongs to the `experiment-results` skill. Do NOT use for plain feature-flag rollouts with no measurement criterion — that belongs to the `feature-flags` skill." +license: Apache-2.0 +--- + +# Experiment Setup + +Coach the user through designing a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, and the sample size + traffic dictate duration and testing model. Reach into `references/` only when a step needs depth. + +## Requirements + +- Access to Mixpanel (event schema, run queries, create experiments and feature flags). +- Access to a prior-experiments lookup when one is available — the skill works without it, but degrades gracefully and tells the user what it skipped. + +## When to use this skill + +Trigger on any of: + +- "Set up / design / configure / plan an experiment on ``." +- "Help me write a hypothesis." +- "What MDE can I detect with my current traffic?" +- "Should this be sequential or fixed-horizon?" +- "Should I enable CUPED / Winsorization / Bonferroni?" +- "How long should this experiment run?" +- "Is this an experiment or should I just ship a feature flag?" +- "Sanity-check / pre-launch / pitfall-check this experiment configuration." + +Do **not** trigger for post-launch analysis ("how did experiment X do?") — that's the `experiment-results` skill. + +--- + +## Pre-flight: route and check for prior work + +**Route XP vs FF before designing.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill-switch, or per-segment gating with no decision criterion → feature flag (route to the `feature-flags` skill). If ambiguous, ask once: "Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?" Deeper disambiguation in [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md). + +**Always search for prior experiments on the same feature first** (by keyword from the feature name, when the lookup is available). Surface anything you find — re-running settled questions wastes traffic, and prior baseline/variance numbers sharpen the new MDE. See [references/prior-experiments.md](references/prior-experiments.md) for the fold-in playbook. + +--- + +## Workflow: 4 steps + +Run in order. Each step's output is the next step's input. + +### Step 1 — Write the hypothesis + +A good hypothesis is a **falsifiable directional claim with a stated mechanism**: + +> **If** ``, **then** `` will ``, **because** ``. + +If vague, hold the user to four commitments: the change, the primary metric, the direction, and the smallest effect worth shipping (the MDE). The "because" forces them to check whether the metric they picked is actually downstream of the change. Deeper rubric and misalignment patterns in [references/hypothesis-framing.md](references/hypothesis-framing.md). + +### Step 2 — Pick metrics that test the hypothesis + +Each metric serves one role: + +- **Primary (1–3 max).** Decides ship/no-ship. Comes from the hypothesis's outcome clause. Each additional primary inflates the false-positive rate. +- **Guardrail (0+, strongly recommended).** Must not regress. A >5% relative regression on any guardrail blocks ship even if the primary wins. +- **Secondary (0+).** Diagnostic only. Never decisional. + +Every primary and guardrail needs an explicit `direction` (`"up"` or `"down"`). The default `"up"` is wrong for cancel / error / latency / abandon metrics — leaving it default silently flips polarity at interpretation. Watch for **lagging-metric / window mismatch** (30-day retention as primary on a 2-week experiment) and the **changed-denominator** trap (metric defined only over treatment-exposed users). Full sanity checklist in [references/metric-selection.md](references/metric-selection.md). + +### Step 3 — Size the experiment with historical data + +Pull baseline rate, variance, and daily traffic from Mixpanel; don't guess. The standard formula (two-sample, two-sided, 95% confidence, 80% power): + +``` +n = 16 × σ² / d² (per variant; Bernoulli σ² = p(1−p)) +``` + +Inverted for traffic-bound teams — the smallest detectable effect at your traffic: + +``` +MDE = 4σ / √n +``` + +If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface this immediately (winner's curse, etc.). Sample-size floor: never below ~350–400 per variant (CLT breaks down, SRM check gets noisy). Worked examples, baseline-lookup table, and the five remediations for underpowered experiments are in [references/sizing.md](references/sizing.md). + +### Step 4 — Pick testing model + end condition + +**Default to `sequential`** for most users. Peeking is the most common customer mistake; sequential makes early-look safe. Override to `frequentist` for small-lift hunts on well-sized experiments, or when the team needs t-test familiarity. + +**End condition.** `sample_size` when daily traffic is variable; `days` when the primary metric has strong weekly seasonality. Frequentist + days is supported — don't flag it. + +**Confidence level.** Default 0.95. Bump to 0.99 only for irreversible high-stakes ships; drop to 0.90 only for exploratory low-stakes tests (and tell the user the family-wise FPR is inflated). + +**Multiple-testing correction.** Auto-needed when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. Default to `"benjamini-hochberg"`; use `"bonferroni"` for strict family-wise control (regulatory). Without correction the family-wise FPR climbs fast — 5 primaries × 3 variants → ~54%. + +Full decision tree and worked numbers in [references/statistical-model.md](references/statistical-model.md). + +--- + +## Advanced features (rationale: [references/advanced-features.md](references/advanced-features.md)) + +- **CUPED** — variance reduction. Enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history exists. Do not enable on new-user-only experiments or one-time-event metrics. +- **Winsorization** — caps extreme values. Enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli metrics. Push back if `percentile < 80`. + +## Pre-launch pitfall check + +Before the user creates the experiment, run the pitfall catalogue in [references/pitfalls.md](references/pitfalls.md). Surface only what fires on the current config; order blockers → warnings → fyi. + +Two **blockers** that should stop launch: + +- `underpowered_duration_insufficient` — expected exposures < 50% of per-arm sample size for the configured MDE. +- `cohort_too_small` — eligible cohort < `num_arms × target_sample_size`. + +The **>5% guardrail hard-gate** rationale: a 5% relative regression on any guardrail blocks ship even if the primary wins. A winning primary with a regressing guardrail trades headline lift for damage to something the team explicitly said must not regress — not a ship. + +--- + +## Output + +Present a compact summary the user confirms before you create the experiment: + +``` +*Experiment Setup Summary* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction: up/down), … +• *Guardrails:* (direction: …), … +• *Variants:* control 50% / treatment 50% (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample_size (per-arm ) | days ( days) +• *Confidence level:* 0.95 +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days +• *Achievable MDE on current traffic:* % relative + +*Pitfall check:* +✅ Underpowered duration — adequate +✅ Cohort size — adequate +⚠️ +``` + +Wait for explicit confirmation before creating the experiment. + +## Writing style + +- Lead with the hypothesis. Every other decision flows from it. +- Use concrete numbers from real data ("baseline 4.2%, σ² = 0.040, required n ≈ 6,400/arm"), not vague guidance. +- Quote the user's MDE and metric names back so they catch typos. +- When underpowered, say so plainly and list remediations in order of cost. +- Don't moralise about peeking — switch them to sequential. +- Guardrail regressions are hard gates, not "slight concerns." + +## Related skills + +- `experiment-results` — post-launch analysis. Use after the experiment ships. +- `feature-flags` — pure rollout / kill-switch / gating without measurement. +- `create-dashboard` — live monitoring of primary + guardrail metrics during the experiment. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/advanced-features.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/advanced-features.md new file mode 100644 index 0000000..6c9ef41 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/advanced-features.md @@ -0,0 +1,103 @@ +# Advanced features + +Three optional settings most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. + +## CUPED — variance reduction + +**What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. + +**Setting:** `settings.cuped.enabled = true`, with `settings.cuped.preExposureDatePreset` choosing the pre-exposure window. + +### When to enable + +- The primary metric correlates with users' pre-exposure behaviour on the same metric. Strong correlations: revenue, engagement (events per user), retention, time-on-platform. Weak correlations: anything one-time or onboarding-specific. +- **All experiment users existed before the experiment start** — i.e., not a new-user-only cohort. CUPED needs a pre-exposure observation period; new users don't have one. +- A 2–4 week pre-exposure window is available with stable behaviour. If the metric was launched 5 days ago, CUPED has nothing to read. + +### When NOT to enable + +- New-user-only experiments. No pre-exposure data exists. CUPED gives zero variance reduction and adds noise. +- Brand-new metrics without historical data. +- Metrics where pre-exposure behaviour is not predictive of post-exposure (e.g., one-time onboarding events: the user either did or didn't complete onboarding once; pre-exposure has nothing to say about it). +- Pre-exposure window short enough that the behaviour you'd "control for" is itself a transient spike (e.g., metric just had a viral moment last week). + +### Pre-exposure window presets + +- `"2-weeks"` — fast-moving metrics with no strong weekly seasonality. +- `"4-weeks"` — most metrics with weekly seasonality (default sweet spot). +- `"60-days"` — deeply seasonal metrics like spend. +- `"90-days"` — long-cycle metrics (renewal-driven revenue, etc.). + +### What changes downstream + +- Required sample size shrinks by the variance-reduction factor. A 50% variance reduction on a primary that needed 60k per arm shrinks the target to ~30k per arm. +- The point estimate of the lift is unchanged. CUPED is a variance-reduction technique, not a bias correction; the headline lift is the same, the confidence interval is narrower. +- The post-launch interpretation step needs to know CUPED was on, because the standard error formula differs. The setting is persisted on the experiment object; the interpretation step reads it automatically. + +## Winsorization — outlier handling + +**What it does.** Caps extreme values at a percentile boundary (default 95th, i.e. cap the top 5% and bottom 5% at the 95th and 5th percentile values respectively). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. + +**Setting:** `settings.winsorization.enabled = true`, with `settings.winsorization.percentile` choosing the cap point. + +### When to enable + +- Revenue or spend metrics with whales (one customer spends 100× the median; that customer assigned to treatment is enough to swing the headline). +- Time-on-page or session-duration metrics with users who fall asleep on the page (one session at 8 hours dwarfs 10,000 sessions at 30 seconds). +- Any Gaussian-distributed metric with a heavy right tail (count metrics, event volume per user, page view counts). + +### When NOT to enable + +- Bernoulli (conversion) metrics. Capping a 0/1 outcome is meaningless; the 95th percentile of a 0/1 distribution is also 0 or 1. +- Metrics where the tail behaviour **is** the hypothesis. If the test is "did this change move whale spending?", Winsorization throws away exactly the signal you're testing for. +- Metrics already winsorized upstream (in the metric definition / data pipeline) — double-winsorization adds nothing. + +### Percentile guidance + +Default is 95 (cap top/bottom 5%). This is almost always right. Push back if the user sets `percentile < 80` — that's >20% of values being capped, which throws away too much signal. Confirm intent before launching. + +For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. 95 is the default for a reason. + +### What changes downstream + +- Variance on the affected metric drops, often substantially. Required sample size shrinks accordingly. +- The point estimate of the mean shifts toward the centre of the distribution. This is the desired behaviour; the whole point is to stop a few outliers from anchoring the estimate. +- The post-launch interpretation step reports the winsorized mean and standard error. If the team also wants to know what the un-winsorized mean did (the "did whales react?" question), they'd need a separate secondary metric without Winsorization. + +## Multiple testing correction — Bonferroni vs Benjamini-Hochberg + +Covered in detail in `references/statistical-model.md`. The short version: + +- Enable when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. +- Default to `"benjamini-hochberg"`. More powerful with correlated primaries. +- Use `"bonferroni"` when family-wise error control is required (regulatory, etc.) or when the primaries are independent. +- Set `"off"` only with a single primary and a single non-control variant. + +## Decision flowchart + +``` +Primary metric is Bernoulli (conversion rate)? +├── Yes → Winsorization OFF. +│ Does it correlate with pre-exposure behavior of existing users? +│ ├── Yes → CUPED ON (if 2-4 week pre-exposure window available, no new-user cohort) +│ └── No → CUPED OFF +└── No (continuous / count / retention) + Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? + ├── Yes → Winsorization ON (default percentile = 95) + └── No → Winsorization OFF + Does it correlate with pre-exposure behavior of existing users? + ├── Yes → CUPED ON (if 2-4 week pre-exposure window available, no new-user cohort) + └── No → CUPED OFF + +Primary count ≥ 2 OR non-control variants ≥ 2? +├── Yes → Multiple testing correction ON ("benjamini-hochberg" default; "bonferroni" for strict family-wise control) +└── No → Multiple testing correction OFF +``` + +## Common misconfigurations + +- ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. +- ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. +- ⛔ **Winsorization at percentile < 80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. +- ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/hypothesis-framing.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/hypothesis-framing.md new file mode 100644 index 0000000..57bbd73 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/hypothesis-framing.md @@ -0,0 +1,101 @@ +# Hypothesis framing + +A good experiment hypothesis is a **falsifiable, directional claim with a stated mechanism, bounded in time**. All four properties matter — drop any one and the design downstream silently degrades. + +## The shape + +> **If** ``, **then** `` will ``, **because** ``. + +| Property | Test | Failure mode | +| ------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | +| **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | +| **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | + +## When the user gives you a one-liner + +Ask them to commit to five things, in order. Don't proceed until you have all five. + +1. **The change** — what's different in treatment. A specific UI string, a routing change, a price, a copy variant. Vague ("the new onboarding") is not enough; "the new onboarding which moves the free-item offer to step 1" is. +2. **The primary outcome metric** — one specific event or rate, not a domain. "Engagement" is not a metric; "weekly active users with ≥1 report created" is. +3. **The expected direction** — up or down. (Goes straight into the metric's `direction` field.) +4. **The minimum effect size that would justify shipping** — this becomes the MDE. If the user can't name one, ask: "If the lift turned out to be 0.5%, would you ship?" Their answer reveals the MDE. +5. **The mechanism** — why you expect this to work. The mechanism is what binds the metric to the change. A change to onboarding screens shouldn't be measured by Day-30 retention if no one has gotten to Day 30 yet — the mechanism would say so explicitly. + +## Mechanism → metric class + +The mechanism predicts the _kind_ of metric that should move. Use this mapping as a sanity check: + +| Mechanism flavour | Likely primary-metric class | Anti-pattern | +| -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------- | +| Reduces friction at a specific step | Step conversion rate (funnel-typed) | Headline retention metric | +| Surfaces a new option / increases discoverability | Click-through or first-use rate on the surfaced option (conversion) | Total events per user | +| Reorders information / changes salience | Time-to-task, completion rate on the salient step | Account-level revenue | +| Changes the cost of an action (price, paywall, friction) | Conversion-to-paid, refund rate, cancel rate (with `direction: "down"`) | DAU | +| Adds a new content / recommendation system | CTR on recommendations, downstream conversion | Aggregate engagement | +| Long-term retention play (referrals, loyalty) | Day-7 or Week-1 retention as leading proxy; lagging Day-30 stays a post-launch monitor, not a primary | Day-30 retention as primary on a 2-week experiment | + +When the user's mechanism and proposed metric live on different rows of this table, push back — that's the **hypothesis ↔ metric mismatch** pitfall. + +## Hypothesis ↔ metric alignment + +A hypothesis names a specific outcome. The primary metric must measure that outcome — **same population, same denominator, same timeframe**. Common misalignments: + +- Hypothesis predicts a **rate** change; primary metric is a **count** → switch to a rate metric, or use an exposure-rebalanced total. +- Hypothesis predicts effect on **paid users**; primary metric includes free users → add a cohort filter or scope the metric. +- Hypothesis predicts effect **within session**; primary metric is **per-user across sessions** → either narrow the metric or broaden the hypothesis. +- Hypothesis predicts effect **only on a new flow**; primary metric counts events that exist only in treatment → changed-denominator. The lift is artificially infinite. Pick a metric that exists for both arms. + +## When to push back + +Push back hard when: + +- The hypothesis is non-falsifiable. Until it can be tested with a yes/no answer from data, there's nothing to set up. +- The hypothesis is non-directional. The system's `direction: "up"` default is wrong for cancel / error / latency / abandon metrics; leaving it default silently flips polarity at interpretation time. +- The mechanism doesn't predict the proposed metric. Most "experiment didn't work because we measured the wrong thing" post-mortems trace back to here. +- The proposed primary is strongly lagging on the planned duration (retention as primary on a 2-week test). Suggest a leading proxy. + +When you push back, do it once with concrete language ("you said 'improve engagement' — which event do you want to move?"). If the user genuinely wants to leave the hypothesis vague, you can proceed, but log the vagueness in `description` so the post-launch step knows the test was exploratory rather than decisional. + +## Worked examples + +### ✅ Good + +> If we surface a free-item offer during onboarding step 2, then signup→activation conversion will increase by ≥3pp (currently 18%), because reducing first-action friction lowers cold-start dropout for new accounts. + +- Falsifiable: data can say "no, lift was <3pp." +- Directional: up. +- Mechanistic: first-action friction → cold-start dropout. +- Time-bounded: signup→activation is a within-session metric; readable inside any reasonable test duration. +- Mechanism predicts a conversion-class primary; signup→activation conversion fits. + +### ✅ Good (lagging hypothesis, leading proxy primary) + +> If we ship the new referral flow, then Day-30 retention will increase by ≥1.5pp, because referred users have stronger network effects. We will measure Day-7 retention as the experiment primary (historical correlation r=0.78 with Day-30) and keep Day-30 as a post-launch monitor. + +- Bounded-in-time problem is acknowledged and solved with a leading proxy. The lagging metric remains a post-launch check, not a ship gate. + +### ❌ Vague + +> Test the new onboarding. + +- No change description (which change? full redesign or one screen?). +- No outcome. +- No direction. +- No MDE. +- No mechanism. + +Coach: pull each of the five commitments out of the user before going further. + +### ❌ Non-falsifiable + +> The new dashboard will improve the user experience. + +- "Improve user experience" can't be tested. Ask: "Which specific behaviour changes if user experience is better? Engagement events per session? Time to first chart? Dashboards saved per user?" + +### ❌ Mechanism doesn't predict the metric + +> If we change the colour of the CTA button, then 30-day retention will increase by ≥2pp, because users will perceive the product as more polished. + +- Mechanism is plausible at best, but Day-30 retention is far downstream of a button-colour change. Even if the colour change does help, a 2-week experiment won't measure it. Either pick a leading proxy (click-through on the CTA) or shelf the test until you have a more credible mechanism for retention. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/metric-selection.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/metric-selection.md new file mode 100644 index 0000000..8ce7ea0 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/metric-selection.md @@ -0,0 +1,75 @@ +# Metric selection + +Each metric on an experiment serves exactly one of three roles. The hypothesis tells you which. + +## Primary metrics (1–3 max) + +The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. + +- **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. +- **Explicit `direction`.** Every primary needs `direction: "up"` or `direction: "down"`. The system defaults to `"up"`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: + - Onboarding-screen change → activation in the first session, not Week-4 retention. + - Checkout button A/B → checkout conversion, not 30-day LTV. + - Pricing-page tweak → click-through and trial start, not annualised revenue. + - When the only metric the team cares about is lagging, use a **leading proxy** with a known historical correlation to the lagging metric. The lagging metric stays a post-launch monitor, not a ship gate. +- **Prefer rates over counts** when the hypothesis is about behaviour change. "Conversion rate" is interpretable; "total conversions" conflates per-user behaviour with cohort size. + +If the user proposes a primary, sanity-check: + +- _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) +- _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). +- _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. +- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) + +## Guardrail metrics (0+, strongly recommended) + +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. + +Standard guardrails by domain — pick at least one from the row that matches the change: + +| Change targets… | Guardrail candidates | +| ------------------------------------ | ------------------------------------------------------- | +| Performance / UI / new client code | Page load time, API latency, error rate, crash rate | +| Engagement / activation / onboarding | Weekly active users, session count, Day-7 retention | +| Revenue / monetisation / pricing | ARPU, conversion-to-paid, refund rate, cancel rate | +| Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | +| Time-to-task / search / IA | Task abandonment rate, time-to-completion | + +For every guardrail, **set `direction` explicitly**. A guardrail named "errors" with default `direction: "up"` will silently let regressions slip through interpretation as "wins." + +Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor. + +## Secondary metrics (0+, diagnostic only) + +Metrics for understanding **why** the primary moved, not for the ship decision. Examples: funnel-step completions, feature sub-use rates, time-on-screen, exploratory cohort breakdowns. + +**Secondary metrics are not decisional.** Even if the user names a secondary in their hypothesis text, they cannot ship/kill on its result. If a metric matters for the decision, it must be primary or guardrail. + +> **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: +> _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ + +This is the `hypothesis_metric_mismatch` pitfall in pre-launch detection — see `references/pitfalls.md`. + +## Sanity checklist + +Run this before locking the metric set: + +- [ ] Each primary directly measures the hypothesis's predicted outcome. +- [ ] Each primary has explicit `direction` (no `null`). +- [ ] At least one guardrail covers the most likely failure mode of the change (perf for UI changes, retention for monetisation changes, etc.). +- [ ] Each guardrail has explicit `direction`. +- [ ] No metric whose denominator is created by the treatment itself (changed-denominator). +- [ ] No primary or guardrail is a strong lagging indicator on the planned experiment duration (use leading proxies; demote lagging metrics to post-launch monitors). +- [ ] Total primary count ≤ 3. +- [ ] If primary count ≥ 2 OR non-control variants ≥ 2, multiple-testing correction is on (`benjamini-hochberg` default, `bonferroni` for strict family-wise control). +- [ ] For each primary, baseline rate has been pulled from real data (not guessed). + +## Anti-patterns + +- ⛔ **No guardrails to "avoid noise."** Guardrails are the regression detection, not noise. Without them, a winning primary with a quietly regressing latency or refund-rate is a ship — and then a rollback two weeks later. +- ⛔ **Five primaries because "they're all important."** Past 3, the false-positive risk dominates. Pick the 1–3 the hypothesis actually predicts; demote the rest to secondaries. +- ⛔ **Primary = "total signups," metric = behaviour change.** A behaviour-change hypothesis needs a rate metric; total signups conflates per-user behaviour with the size of the cohort that entered the experiment. +- ⛔ **Guardrail with default `direction: "up"` on an error / cancel / latency metric.** Silently inverts the regression check. +- ⛔ **30-day retention as primary on a 2-week experiment.** Either the lagging metric can't move (no signal) or it moves on noise (false significance). Use a leading proxy. +- ⛔ **Primary metric only exists in treatment.** Changed denominator. Lift is meaningless. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/pitfalls.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/pitfalls.md new file mode 100644 index 0000000..dc7cc18 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/pitfalls.md @@ -0,0 +1,135 @@ +# Pre-launch pitfalls + +This is the catalogue of the deterministic checks the agent runs before the user creates an experiment. **Detection logic lives in the platform's pre-launch validation capability**; this document owns the prose — the _why_ behind each check — so the agent can explain the violation in human terms rather than just nagging. + +For the source-of-truth severities, thresholds, and message templates, see `ai/engine/tools/experiments/_shared/pitfall_prose.py` in `mixpanel/analytics`. When that file changes, this document changes too. + +## Triage order + +The agent surfaces pitfalls in this order: + +1. **Blockers first.** An experiment that triggers a blocker should not launch as-is. Two pitfalls today: `underpowered_duration_insufficient` and `cohort_too_small`. Both mean the experiment literally cannot reach statistical power for the configured MDE. +2. **Warnings next.** Configuration smells that would degrade interpretability or trustworthiness. Most fall here. +3. **FYIs last.** Soft nudges; not blocking even if the user ignores them. + +Within a severity tier, surface in this order (most actionable first): data-trust risks (pre-experiment bias, variance inflation) → configuration nudges (guardrails, hypothesis alignment). + +## The >5% guardrail hard-gate + +The single most important rule in the catalogue. **A 5% relative regression on any guardrail blocks ship even if the primary wins.** + +### Why 5% + +The threshold is calibrated to be tight enough to catch real degradations of user experience, revenue, or performance, and loose enough that day-to-day noise on a moderately-volatile guardrail doesn't trip it on every test. + +- Below 5%: typically within the noise band of most guardrails on a 2-week test. Tightening below 5% would generate too many false alarms. +- Above 5%: the team has implicitly traded measurable user/revenue/performance damage for headline-metric lift. That's not a ship — that's a re-design. + +### Why "hard gate" + +Guardrails are not "things to also look at." They are the **trustworthiness backstop**. A winning primary with a regressing guardrail means the change _exchanged_ something the team agreed must not regress for the headline-metric lift. If guardrails are negotiable, they aren't guardrails. + +### Why explain it to the user + +The most common reaction to a guardrail regression is "but the primary won, can't we just ship?" The agent's job is to make the trade-off explicit: + +> "Primary metric `` won by +2.3pp, but guardrail `` regressed by 7.4%. The 5% threshold exists because guardrails are the trustworthiness backstop — a winning primary with a regressing guardrail means you've traded `` for ``, which is a design choice that needs explicit sign-off, not a ship decision." + +If the team genuinely wants to make that trade, they can disable the guardrail before launch and document the decision in `description`. Don't let them silently override; force the conversation. + +--- + +## The catalogue + +Each entry lists: kind → severity → trigger condition → why it matters → what to recommend. The message templates are in `pitfall_prose.py`; reproduced inline here for context. + +### `underpowered_duration_insufficient` — blocker + +**Trigger.** Expected exposures (`exposures_per_day × planned_days × n_arms`) are less than 50% of the per-arm sample size required to detect the configured MDE at the baseline rate. + +**Why it matters.** The experiment cannot reach statistical power for this MDE no matter how clean the rest of the config is. If launched, the most likely outcome is "inconclusive" — and a non-trivial fraction of those inconclusive results will be due to noise crossing the significance threshold rather than a real effect, the winner's-curse problem. + +**Recommendation.** Extend planned duration by roughly `(n_required − expected_exposures) / exposures_per_day` days, OR relax the MDE (only ship if the lift is bigger), OR pick a higher-volume primary metric, OR enable CUPED if pre-exposure data is available (which can cut required `n` by 30–70%). + +### `cohort_too_small` — blocker + +**Trigger.** Cohort size is smaller than `num_arms × target_sample_size`. The cohort cannot supply enough eligible users. + +**Why it matters.** Same root cause as the duration blocker, different lever. Even with infinite time, the experiment will run out of eligible users before each arm reaches the per-arm target. + +**Recommendation.** Either expand the cohort to ~`num_arms × target_sample_size` eligible users (relax filters, broaden segment, extend eligibility window), or lower the per-arm target sample size to what the cohort can actually supply (and accept the larger achievable MDE that comes with it). + +### `pre_experiment_bias_likely` — warning + +**Trigger.** Retrospective A/A is enabled, at least one continuous-ish metric (continuous, retention, or funnel) is configured, AND CUPED is off. + +**Why it matters.** Pre-experiment bias is likely on metrics with seasonality or power-user skew. Without CUPED to absorb the baseline difference, post-experiment lifts will inherit it — the team will see "treatment up 2%" when the real treatment effect is 0% and the baseline difference is +2%. + +**Recommendation.** Enable CUPED with a 2–4 week pre-exposure window. CUPED specifically handles this case: it regresses out the pre-exposure baseline difference so the post-exposure lift is the actual treatment effect. + +### `high_variance_no_winsorization` — warning + +**Trigger.** At least one continuous-ish metric is configured AND Winsorization is off. + +**Why it matters.** Outliers will inflate variance and widen confidence intervals. A handful of power users can dominate the per-arm mean, swinging the headline based on which arm those users got assigned to. + +**Recommendation.** Enable Winsorization with default percentile 95. Push back if the user sets percentile <80 (that's >20% of values capped — almost always a misconfiguration). + +### `multiple_primaries_no_bonferroni` — warning + +**Trigger.** ≥2 primary metrics configured AND multiple-testing correction is off. + +**Why it matters.** Family-wise false-positive rate compounds with each additional primary. At 3 primaries the FPR is ~14.3%; at 5 it's ~22.6% — more than one in five "wins" is noise. + +**Recommendation.** Enable multiple-testing correction. Default to Benjamini-Hochberg (more powerful with correlated metrics); use Bonferroni for strict family-wise error control. The name of this pitfall is historical — the correction need not be Bonferroni specifically. + +### `underpowered_duration_marginal` — warning + +**Trigger.** Expected exposures are between 50% and 100% of the per-arm sample size required for the configured MDE. + +**Why it matters.** Marginally underpowered. The experiment might reach significance on a true effect; it might not. Either way, the lift estimate at conclusion will be wider than expected. + +**Recommendation.** Extend duration to reach 100%+ of the required sample, or accept the higher Type-II error rate (more chance of missing a real effect). Less urgent than the `_insufficient` variant. + +### `missing_guardrails` — warning + +**Trigger.** Zero guardrail metrics configured. + +**Why it matters.** Without guardrails, there's no >5% hard-gate to block a ship on a regression to user experience, revenue, or performance. The team is implicitly trusting that the primary metric captures every relevant impact — which is rarely true. + +**Recommendation.** Add at least one guardrail covering the most likely failure mode of the change. Standard choices: + +- UI change → page-load time or error rate. +- Monetisation / pricing → cancel rate or refund rate. +- Engagement change → Day-7 retention or session count. +- Performance change → error rate or crash rate. + +### `hypothesis_metric_mismatch` — warning + +**Trigger.** The hypothesis text mentions one of the canonical metric nouns (`conversion`, `retention`, `revenue`, `signup`, `engagement`, `click`, `purchase`) but no primary metric's name appears to measure that outcome. + +**Why it matters.** Soft signal — the heuristic is coarse, but it catches the common case where the user wrote "X will increase conversion" and then set the primary to "session count" or vice versa. If the user's hypothesis is about conversion, the primary should be a conversion metric. + +**Recommendation.** Phrase as a question, not a verdict: _"Your hypothesis mentions ``, but no primary metric name suggests it measures that. Should `` be replaced or supplemented with a metric that more directly tests the hypothesis?"_ + +### `primary_lacks_leading_indicator` — warning + +**Trigger.** Primary metrics include a retention-type metric (lagging by construction) AND no leading-indicator secondary (conversion or funnel type) is configured. + +**Why it matters.** A retention primary is valid but reads slowly — there may not be enough signal to interpret results before the experiment concludes. Without a leading-indicator secondary, the agent has no early-read evidence to reason from. + +**Recommendation.** Add a leading-indicator secondary metric (a conversion or funnel metric measured within the experiment runtime). The retention primary stays as the ship decision; the secondary just gives early visibility. + +--- + +## Detection vs prose + +The detection math lives in the platform's pre-launch validation capability. The prose lives here and in `pitfall_prose.py`. The two are connected by the pitfall `kind` field — the validation step reports the kind, the agent renders the message. + +This separation lets product retune phrasing (this document, `pitfall_prose.py`) without touching the detection helpers, and vice versa. When you update a threshold (e.g., the 50% / 100% bounds on the underpowered checks), update the helper math, the `_shared/pitfall_prose.py` constant, and the recommendation in this document together — the agent will quote stale numbers if any of the three drifts. + +## What's not in the catalogue (yet) + +- **Cross-test contamination** — when the same users are eligible for multiple concurrent experiments on the same surface. Hard to detect statically; usually surfaces as anomalous variance at interpretation time. +- **Novelty effect detection** — early days of the experiment show inflated treatment effect, then settle. Not a pre-launch check; lives in the post-launch interpretation skill. +- **Seasonality misalignment** — running a 2-week experiment that doesn't align to weekly cycles. Today this is detected indirectly via the duration check; a future explicit `seasonality_misaligned` pitfall is a reasonable add. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/prior-experiments.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/prior-experiments.md new file mode 100644 index 0000000..5a0beee --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/prior-experiments.md @@ -0,0 +1,81 @@ +# Prior experiments + +When a user proposes an experiment on a feature, **the first thing to do is look up prior experiments on that feature**. Skipping this leads to redundant tests, contradictory ship decisions, and wasted traffic. + +## The lookup + +Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. + +If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. + +## What to do with what you find + +### Same feature already tested and shipped + +Reference the prior result before recommending a new test. The right answer is often "don't re-run; iterate on a new hypothesis." + +> "There's a prior experiment from [date] on the same feature with a similar hypothesis: it shipped at +X% on metric Y. Re-running won't tell us anything new. What's different about the change you're proposing? Is the new hypothesis about a different sub-population, a different metric, or a different mechanism?" + +If the user does want to re-run (e.g., the population has shifted significantly, the underlying product has changed, or the prior test was clearly underpowered), proceed — but design the new test to specifically address what's different from the prior. + +### Same feature tested and killed + +Treat this as a strong prior. Ask why the user thinks the new variant will work where the prior didn't. + +> "Prior experiment [date] on the same surface killed at [-X% / inconclusive]. What's different about your change that should produce a different outcome? If the prior failed because of [mechanism], does your change address that?" + +If the user can articulate a different mechanism, run the new test. If they can't, the most likely outcome is a repeat of the prior result — discourage the test or downgrade its priority. + +### Earlier iteration of the same hypothesis + +Use the prior result to inform the new design — specifically, **baseline rates and variance estimates**. Prior data is much more reliable than guessing. + +- Pull the prior's control-arm baseline rate; use it as the baseline for the new sizing calculation. +- Pull the prior's observed variance; use it instead of estimating from scratch. +- Pull the prior's exposure rate (exposures per day per variant); use it to set a realistic duration estimate. + +This often shrinks the required sample size or shortens the planned duration. Both are wins worth surfacing. + +### Recently concluded with similar metrics + +Pull the realised exposure rate. The "expected exposures per day" the user has in mind is usually higher than what actually shows up in a real experiment on the same surface — eligibility filters, opt-outs, and bot exclusion all bite. Use the prior's actual rate, not the theoretical one. + +### Multiple prior experiments on adjacent surfaces + +Look for **patterns**, not single data points. If three prior tests on the same funnel stage all moved in the same direction by similar magnitudes, that's the realistic prior for what the new test will do. If the prior tests are noisy or contradictory, treat the new test's expected lift with more uncertainty and consider running it longer. + +## Folding prior results into the new design + +Concretely, when you have a prior result that's relevant, the setup workflow changes as follows: + +| Step | Without prior | With prior | +| -------------------------- | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | +| Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | +| Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | +| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | + +## When prior tests warn you away from testing at all + +Sometimes the prior data tells you the right answer is **don't run the experiment**: + +- The metric the user wants to move has been tested 4 times on this surface in the last year, all with inconclusive or null results, all adequately powered. The hypothesis-space is likely exhausted; suggest a different mechanism or a different surface. +- The baseline rate is so low that even the prior, well-powered tests couldn't detect anything below a 30% relative lift. The new test would inherit the same constraint. Either pick a higher-volume proxy metric or accept that the change has to be very large to be detectable. +- Recent guardrail regressions on the same surface suggest the surface is unstable; running more experiments without first fixing the trust issue is wasted traffic. + +Surface these findings as recommendations, not blockers. The user might have context the prior data doesn't capture. + +## What to record about the new design's relationship to prior tests + +In the experiment's `description` field, link to the prior experiment(s) and note how the new design differs. This becomes critical at interpretation time — the post-launch step uses the prior context to calibrate its read of the new result. + +A useful template: + +``` +Prior: tested on , result: . +This experiment differs by: . +Inherited from prior: baseline rate (X%), σ², exposure rate (N/day/variant). +``` + +This is a 30-second annotation that pays back tenfold at analysis time. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/routing-xp-vs-ff.md new file mode 100644 index 0000000..a38a72d --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/routing-xp-vs-ff.md @@ -0,0 +1,85 @@ +# XP vs FF: routing intent + +Before any setup work, decide whether the user actually wants an **experiment** (XP) or just a **feature flag** (FF). The decision is binary, but the language users use is blurry — "let's A/B test this" sometimes means "let's run a controlled experiment with a hypothesis and a stopping rule," and sometimes means "I want to ship it to 10% of users and see if anything breaks." + +## The discriminator + +| If the user wants… | Then it's a… | +| --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | +| Causal evidence — "does this change move metric X by enough to justify shipping?" | **Experiment** (XP). | +| Progressive rollout — "ship to 10%, then 50%, then 100% if nothing breaks." | **Feature flag** (FF). | +| Kill-switch — "I want to be able to turn this off instantly if it goes sideways." | **Feature flag** (FF). | +| Per-segment gating — "only show this to enterprise customers." | **Feature flag** (FF). | +| Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | +| Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | + +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when you call `create_experiment`); not every feature flag use case needs an experiment. + +## Disambiguation prompt + +When you can't tell from the user's wording, ask once, plainly: + +> "Are you trying to **measure** whether this change moves a metric (experiment), or are you rolling it out gradually / behind a flag with **no measurement criterion** (feature flag)? An experiment commits to a hypothesis, metrics, and a stopping rule; a feature flag is purely a delivery mechanism." + +Listen for these signals in the answer: + +- "I want to see if it improves X" / "if checkout conversion goes up" → experiment. +- "I want to make sure it doesn't break X" → could be either. Probe: "Is 'doesn't break' a measurable threshold, like a guardrail, or is it 'I'll watch dashboards and roll back if it's obviously bad'?" +- "I want enterprise to get it first" / "I want to roll out by region" → feature flag. +- "I just want a kill switch" → feature flag. +- "I want to ship it and prove ROI later" → ask whether the proof needs to be causal. If yes, that's an experiment, and it should be set up _before_ shipping, not after. (Post-hoc ROI claims from a flag rollout are not credible.) + +## Common ambiguous cases + +### "Ship to 10% as an experiment" + +Often this means "phased rollout, monitor metrics, ramp if nothing regresses." That's a feature flag with manual ramp logic, not an experiment. + +Ask: "Do you have a primary metric you're committing to before launch, with an MDE that decides whether to ship to 100%?" If yes, run as an experiment. If no, ship as a flag. + +### "I want to test the new pricing on enterprise customers" + +If "test" means "see how they react and decide whether to roll out," and the audience is small (a few enterprise customers), that's a **rollout**, not an experiment. Enterprise samples are usually too small to power an experiment, and the per-account variance is too high for a meaningful aggregate. + +Run as a flag, gather qualitative feedback, and decide based on the conversations — not on a p-value computed from N=4. + +### "Hold out a control while we ship to 100%" + +This is the classic "holdout experiment." Legitimate use case, but it has to be set up as an experiment up front (with a primary metric and a duration), not retroactively. After-the-fact holdout analysis suffers from selection bias and is not credible. + +If the user has already shipped to 100% and wants to "analyse the effect," there is no experiment to set up. Tell them so, and suggest a forward-looking test on the next change to the same surface. + +### "Just give me an A/B test, the simplest one" + +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. + +### "I want a feature flag but with stats" + +Now you're back to an experiment. Run the full setup workflow. + +## What changes once you've routed + +### If experiment + +Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. + +### If feature flag + +This skill stops. Hand off to the user (or to a `manage-feature-flags` skill if one exists): + +- They configure variants, targeting, and rollout percentages directly. +- No hypothesis, no MDE, no stopping rule needed. +- Mixpanel doesn't compute lift or significance on a flag — they're on their own for observation. + +Make sure the user understands the trade-off explicitly: "Choosing flag means you give up the ship/no-ship decision criterion. If later you want to claim the change worked, that claim won't have the same evidentiary weight as a properly-designed experiment." + +## Don't run an experiment when + +There are cases where an experiment is technically possible but the wrong move: + +- **Sample is too small.** Enterprise rollouts to ~10 accounts cannot power a real test. Ship as a flag and use qualitative feedback. +- **Treatment is risky/irreversible.** A real billing change with potential refunds shouldn't run as a 50/50 split — phase as a flag with conservative rollout and direct monitoring. +- **No baseline data.** Brand-new metric, brand-new feature, no historical observation. Run a 1–2 week passive observation period first, then design the experiment from real numbers. +- **Hypothesis is "let's see what happens."** No directional commitment means the test will be interpreted post-hoc, which is the same as not running an experiment. + +Suggest the alternative explicitly so the user doesn't feel rejected — "this isn't an experiment-shaped problem; here's what to do instead." diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/sizing.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/sizing.md new file mode 100644 index 0000000..8122307 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/sizing.md @@ -0,0 +1,109 @@ +# Sizing the experiment + +You almost never know the right sample size by guessing. Pull the data first, then run the math. + +## The standard formula + +Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Where: + +- `σ²` = variance of the metric (depends on metric type — see below). +- `d` = MDE in the same units as the metric. + +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. + +## Variance by metric type + +- **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. +- **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. + +## Worked example + +Detecting a 5% **relative** lift on a 10% baseline conversion rate at 80% power, 95% confidence: + +``` +p = 0.10 +σ² = 0.10 × 0.90 = 0.09 +absolute MDE = 0.10 × 0.05 = 0.005 +n = 16 × 0.09 / 0.005² = 16 × 0.09 / 0.000025 = 57,600 per variant +``` + +That's ~57,600 per variant for a 5% relative lift — humbling, and surprising to most teams. Most "we'll just run it for two weeks" plans don't survive contact with this number. + +## Kohavi's inverted formula + +For most online experiments, traffic is the constraint, not patience. Pick a duration (2–4 weeks captures weekly cycles), use all available traffic in that window, then compute the **achievable MDE**: + +``` +MDE = 4σ / √n +``` + +This tells the user: "given your traffic, the smallest effect you can reliably detect is X." If that achievable MDE is larger than the lift the user actually expects, the experiment is **underpowered**. Flag immediately. + +Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." + +## Estimating the inputs from real data + +For each primary metric, before sizing, you need three numbers: + +1. **Baseline rate** — query the metric over the prior 2–4 weeks (the longer of: one full business cycle, or four weeks). Record `mean` and `variance`. Use the same event definition, segment filters, and unit-of-analysis you'll use in the experiment — a baseline computed differently from how the metric is configured in the experiment is worse than no baseline at all. +2. **Daily traffic** — query the exposure event (or whatever event qualifies users for the experiment) over the same window, grouped by day. Average to get expected exposures per day per variant. +3. **MDE the user wants** — ask explicitly. _"What's the smallest lift that would be worth shipping?"_ If they don't know, propose a 5–10% relative lift and confirm. + +From those three: + +``` +required_sample_per_variant = 16 × σ² / (baseline × MDE_relative)² +required_days = required_sample_per_variant × n_variants / daily_traffic_per_variant +``` + +If `required_days > 28` (four weeks), the experiment is **underpowered for the requested MDE on available traffic**. Tell the user. Don't wave it through. + +## Five remediations when the experiment is underpowered + +Offer these in order of cost — cheap first. + +1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. +2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. +5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. + +## Sample-size floor + +Independent of the math: never set `sampleSize` below ~350–400 per variant. Below this, the statistical machinery itself becomes unreliable — CLT breaks down, the SRM check gets noisy. The Mixpanel default of 10,000 per variant is fine for most tests; 1,000 is the practical floor; 350–400 is the absolute floor. + +If the math says `n = 50` per variant, the test is either trivially easy (the lift is huge) or the variance estimate is wrong. Sanity-check before launching at the floor. + +## Lookup table (Bernoulli, 95% conf, 80% power) + +For a Bernoulli (conversion-rate) primary metric at 95% confidence, 80% power, two-sided test, MDE expressed as a **relative** lift on the baseline: + +| Baseline rate | MDE = 5% relative | MDE = 10% relative | MDE = 20% relative | +| ------------- | ----------------- | ------------------ | ------------------ | +| 1% | ~633k / variant | ~158k / variant | ~40k / variant | +| 5% | ~122k / variant | ~31k / variant | ~7.6k / variant | +| 10% | ~58k / variant | ~14k / variant | ~3.6k / variant | +| 25% | ~19k / variant | ~4.8k / variant | ~1.2k / variant | +| 50% | ~6.4k / variant | ~1.6k / variant | ~400 / variant | + +Use this for quick sanity-checking. Always confirm with a query against actual baseline data — these are illustrative. + +## Sample-size growth with variants + +For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. + +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. + +## Duration considerations + +- **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. +- **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/statistical-model.md b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/statistical-model.md new file mode 100644 index 0000000..6158008 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-setup/references/statistical-model.md @@ -0,0 +1,100 @@ +# Statistical model + +Once required sample size and acceptable duration are known, two settings are left: `settings.testingModel` and `settings.endCondition`. Plus the two adjacent settings that change how those tests are interpreted: `settings.confidenceLevel` and `settings.multipleTestingCorrection`. + +## Testing model: sequential vs frequentist + +**Default to `sequential`** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. + +### Pick `sequential` when + +- The user expects a **large lift** and wants to confirm or reject the hypothesis quickly. Sequential lets you stop the moment significance is reached — often days or weeks before a frequentist target. +- The user wants to check results before the experiment ends and act on them (early-stop on a clear winner). +- The expected effect size is uncertain (could be huge, could be tiny). Sequential adapts; frequentist needs you to commit to one MDE up front. +- The team will look at intermediate results regardless. Sequential prevents peeking from inflating false positives. +- The user is comfortable with slightly more complex stopping rules ("stop when the test-statistic crosses the boundary," not "stop when n reaches N"). + +### Pick `frequentist` when + +- The user is hunting for a **very small lift** (e.g. 1–2% relative on a high-volume metric). Frequentist's fixed-sample design is statistically more efficient at the margin and avoids the early-stop boundary inflation that costs power on tiny effects. +- The team is comfortable waiting for the full sample before checking results — no peeking. +- The team prefers wider industry familiarity ("we used a t-test"). +- The user wants the simplest reportable statistics (a single p-value and confidence interval at the end). +- The team has internal training / tooling that assumes frequentist. + +### The "I want to peek with frequentist" trap + +The most common request is "I want frequentist, but I also want to look at the results during the test." This inflates the false-positive rate enormously — naive peeking on a frequentist test at 5 evenly-spaced check-ins pushes the family-wise α from 5% to ~14%. + +Switch them to sequential. Sequential's whole point is making peeking safe. + +If the user insists on frequentist + peeking (some teams do, for tooling reasons), document the decision in `description` so the interpretation step later knows the reported p-values overstate confidence. + +## End condition: sample_size vs days + +### Pick `sample_size` when + +- The team has a target MDE and wants the experiment to stop the moment the required sample is reached. Adaptive duration. +- Daily traffic is highly variable. Sample-size-based ends absorb the variability; date-based ends don't. +- There's no strong seasonality in the primary metric that would bias a mid-cycle stop. + +### Pick `days` when + +- The primary metric has **strong weekly (or other periodic) seasonality**. Pin the duration to a multiple of the seasonal cycle so each variant sees the same mix of high- and low-traffic periods. + - A common pattern: customers with strong weekday/weekend behaviour shifts run all experiments in 1-week increments (or 2 weeks for a stricter check) to fully capture each cycle. + - A `sample_size` end can fire mid-cycle and produce biased results in this case. +- The team has a fixed business window (e.g. "we want to ship by end of quarter"). +- The team has historically struggled with experiments running indefinitely. +- The hypothesis specifically requires a calendar window (e.g. a holiday-season test). + +### Combinations + +All four combinations are valid. The one customers most often miss is **`frequentist + days`** — some teams prefer time-based experiments for operational reasons even when running frequentist tests. Don't flag this as a misconfiguration. + +The one that's actually wrong is **`frequentist + sample_size + peeking`** — that's the "peeking trap" above. Surface it; switch them to sequential. + +## Confidence level + +Default `settings.confidenceLevel: 0.95` (α = 0.05). Change only with intent. + +- **`0.99`** — for high-stakes irreversible ships (e.g. billing changes, deletion-flow changes, anything regulatory). Higher false-negative cost; accept it. Document the reason in `description`. +- **`0.90`** — for low-stakes exploratory tests where speed matters more than rigour. Acknowledge the inflated false-positive rate to the user explicitly: at α = 0.10, one in ten "wins" is noise. + +Any change away from 0.95 belongs in `description`. The post-launch interpretation step uses this field to read the result correctly; without it, a "win" at 0.90 looks the same as a "win" at 0.95. + +## Multiple testing correction + +Enable when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. Without correction, the family-wise false-positive rate compounds: + +| Primaries | Non-control variants | Family-wise FPR at per-test α = 0.05 | +| --------: | -------------------: | -----------------------------------: | +| 1 | 1 | 5.0% | +| 2 | 1 | ~9.75% | +| 3 | 1 | ~14.3% | +| 5 | 1 | ~22.6% | +| 5 | 2 | ~40.1% | +| 5 | 3 | ~53.7% | + +The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. + +Two methods are available: + +- **`"bonferroni"`** — divides α by the number of tests (`n_primary × n_non_control_variants`). Simple and conservative. Guarantees the family-wise error rate stays below α, but can be overly strict when many primary metrics are correlated, hurting power. +- **`"benjamini-hochberg"`** — controls the **false discovery rate** (FDR) instead of the family-wise error rate. Ranks all primary-metric p-values and applies progressively looser thresholds. More powerful than Bonferroni when there are many primary metrics, especially when some have real effects. Preferred when the user has 3+ primaries or correlated metrics. + +**Default to `"benjamini-hochberg"`** for most experiments — less conservative, better suited to typical designs with correlated metrics. Use `"bonferroni"` when: + +- The user needs strict family-wise error control (regulatory, high-stakes decisions where any single false positive is unacceptable). +- The primary metrics are independent (no shared drivers / overlapping populations), in which case Bonferroni's conservatism is not a real cost. +- The team explicitly asks for the simplest method to defend in a review. + +Set `settings.multipleTestingCorrection: "off"` **only** when there's a single primary and a single non-control variant. + +## Power vs significance trade-off + +When the user pushes you on `confidenceLevel`: + +- Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." +- Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. + +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/SKILL.md new file mode 100644 index 0000000..7de583c --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/SKILL.md @@ -0,0 +1,151 @@ +--- +name: experiment-setup +description: "Coach an experimenter through designing a Mixpanel experiment before launch — hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), and pitfall avoidance. Use when the user wants to set up, configure, design, plan, or sanity-check a new A/B test, feature-flag experiment, or growth experiment. Also trigger on phrasings like 'help me set up an experiment', 'design an A/B test', 'should this be sequential or fixed', 'what MDE can I detect', 'how long should this run', 'is my experiment configured correctly', 'pre-launch checklist', 'should I use CUPED / Winsorization / Bonferroni', 'is this an experiment or just a feature flag', or when the user names a specific feature they want to test. Do NOT use for post-launch results analysis ('how did experiment X do?', 'should we ship?', 'why is SRM failing?') — that belongs to the `experiment-results` skill. Do NOT use for plain feature-flag rollouts with no measurement criterion — that belongs to the `feature-flags` skill." +license: Apache-2.0 +--- + +# Experiment Setup + +Coach the user through designing a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, and the sample size + traffic dictate duration and testing model. Reach into `references/` only when a step needs depth. + +## Requirements + +- Access to Mixpanel (event schema, run queries, create experiments and feature flags). +- Access to a prior-experiments lookup when one is available — the skill works without it, but degrades gracefully and tells the user what it skipped. + +## When to use this skill + +Trigger on any of: + +- "Set up / design / configure / plan an experiment on ``." +- "Help me write a hypothesis." +- "What MDE can I detect with my current traffic?" +- "Should this be sequential or fixed-horizon?" +- "Should I enable CUPED / Winsorization / Bonferroni?" +- "How long should this experiment run?" +- "Is this an experiment or should I just ship a feature flag?" +- "Sanity-check / pre-launch / pitfall-check this experiment configuration." + +Do **not** trigger for post-launch analysis ("how did experiment X do?") — that's the `experiment-results` skill. + +--- + +## Pre-flight: route and check for prior work + +**Route XP vs FF before designing.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill-switch, or per-segment gating with no decision criterion → feature flag (route to the `feature-flags` skill). If ambiguous, ask once: "Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?" Deeper disambiguation in [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md). + +**Always search for prior experiments on the same feature first** (by keyword from the feature name, when the lookup is available). Surface anything you find — re-running settled questions wastes traffic, and prior baseline/variance numbers sharpen the new MDE. See [references/prior-experiments.md](references/prior-experiments.md) for the fold-in playbook. + +--- + +## Workflow: 4 steps + +Run in order. Each step's output is the next step's input. + +### Step 1 — Write the hypothesis + +A good hypothesis is a **falsifiable directional claim with a stated mechanism**: + +> **If** ``, **then** `` will ``, **because** ``. + +If vague, hold the user to four commitments: the change, the primary metric, the direction, and the smallest effect worth shipping (the MDE). The "because" forces them to check whether the metric they picked is actually downstream of the change. Deeper rubric and misalignment patterns in [references/hypothesis-framing.md](references/hypothesis-framing.md). + +### Step 2 — Pick metrics that test the hypothesis + +Each metric serves one role: + +- **Primary (1–3 max).** Decides ship/no-ship. Comes from the hypothesis's outcome clause. Each additional primary inflates the false-positive rate. +- **Guardrail (0+, strongly recommended).** Must not regress. A >5% relative regression on any guardrail blocks ship even if the primary wins. +- **Secondary (0+).** Diagnostic only. Never decisional. + +Every primary and guardrail needs an explicit `direction` (`"up"` or `"down"`). The default `"up"` is wrong for cancel / error / latency / abandon metrics — leaving it default silently flips polarity at interpretation. Watch for **lagging-metric / window mismatch** (30-day retention as primary on a 2-week experiment) and the **changed-denominator** trap (metric defined only over treatment-exposed users). Full sanity checklist in [references/metric-selection.md](references/metric-selection.md). + +### Step 3 — Size the experiment with historical data + +Pull baseline rate, variance, and daily traffic from Mixpanel; don't guess. The standard formula (two-sample, two-sided, 95% confidence, 80% power): + +``` +n = 16 × σ² / d² (per variant; Bernoulli σ² = p(1−p)) +``` + +Inverted for traffic-bound teams — the smallest detectable effect at your traffic: + +``` +MDE = 4σ / √n +``` + +If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface this immediately (winner's curse, etc.). Sample-size floor: never below ~350–400 per variant (CLT breaks down, SRM check gets noisy). Worked examples, baseline-lookup table, and the five remediations for underpowered experiments are in [references/sizing.md](references/sizing.md). + +### Step 4 — Pick testing model + end condition + +**Default to `sequential`** for most users. Peeking is the most common customer mistake; sequential makes early-look safe. Override to `frequentist` for small-lift hunts on well-sized experiments, or when the team needs t-test familiarity. + +**End condition.** `sample_size` when daily traffic is variable; `days` when the primary metric has strong weekly seasonality. Frequentist + days is supported — don't flag it. + +**Confidence level.** Default 0.95. Bump to 0.99 only for irreversible high-stakes ships; drop to 0.90 only for exploratory low-stakes tests (and tell the user the family-wise FPR is inflated). + +**Multiple-testing correction.** Auto-needed when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. Default to `"benjamini-hochberg"`; use `"bonferroni"` for strict family-wise control (regulatory). Without correction the family-wise FPR climbs fast — 5 primaries × 3 variants → ~54%. + +Full decision tree and worked numbers in [references/statistical-model.md](references/statistical-model.md). + +--- + +## Advanced features (rationale: [references/advanced-features.md](references/advanced-features.md)) + +- **CUPED** — variance reduction. Enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history exists. Do not enable on new-user-only experiments or one-time-event metrics. +- **Winsorization** — caps extreme values. Enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli metrics. Push back if `percentile < 80`. + +## Pre-launch pitfall check + +Before the user creates the experiment, run the pitfall catalogue in [references/pitfalls.md](references/pitfalls.md). Surface only what fires on the current config; order blockers → warnings → fyi. + +Two **blockers** that should stop launch: + +- `underpowered_duration_insufficient` — expected exposures < 50% of per-arm sample size for the configured MDE. +- `cohort_too_small` — eligible cohort < `num_arms × target_sample_size`. + +The **>5% guardrail hard-gate** rationale: a 5% relative regression on any guardrail blocks ship even if the primary wins. A winning primary with a regressing guardrail trades headline lift for damage to something the team explicitly said must not regress — not a ship. + +--- + +## Output + +Present a compact summary the user confirms before you create the experiment: + +``` +*Experiment Setup Summary* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction: up/down), … +• *Guardrails:* (direction: …), … +• *Variants:* control 50% / treatment 50% (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample_size (per-arm ) | days ( days) +• *Confidence level:* 0.95 +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days +• *Achievable MDE on current traffic:* % relative + +*Pitfall check:* +✅ Underpowered duration — adequate +✅ Cohort size — adequate +⚠️ +``` + +Wait for explicit confirmation before creating the experiment. + +## Writing style + +- Lead with the hypothesis. Every other decision flows from it. +- Use concrete numbers from real data ("baseline 4.2%, σ² = 0.040, required n ≈ 6,400/arm"), not vague guidance. +- Quote the user's MDE and metric names back so they catch typos. +- When underpowered, say so plainly and list remediations in order of cost. +- Don't moralise about peeking — switch them to sequential. +- Guardrail regressions are hard gates, not "slight concerns." + +## Related skills + +- `experiment-results` — post-launch analysis. Use after the experiment ships. +- `feature-flags` — pure rollout / kill-switch / gating without measurement. +- `create-dashboard` — live monitoring of primary + guardrail metrics during the experiment. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/references/advanced-features.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/advanced-features.md new file mode 100644 index 0000000..6c9ef41 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/advanced-features.md @@ -0,0 +1,103 @@ +# Advanced features + +Three optional settings most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. + +## CUPED — variance reduction + +**What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. + +**Setting:** `settings.cuped.enabled = true`, with `settings.cuped.preExposureDatePreset` choosing the pre-exposure window. + +### When to enable + +- The primary metric correlates with users' pre-exposure behaviour on the same metric. Strong correlations: revenue, engagement (events per user), retention, time-on-platform. Weak correlations: anything one-time or onboarding-specific. +- **All experiment users existed before the experiment start** — i.e., not a new-user-only cohort. CUPED needs a pre-exposure observation period; new users don't have one. +- A 2–4 week pre-exposure window is available with stable behaviour. If the metric was launched 5 days ago, CUPED has nothing to read. + +### When NOT to enable + +- New-user-only experiments. No pre-exposure data exists. CUPED gives zero variance reduction and adds noise. +- Brand-new metrics without historical data. +- Metrics where pre-exposure behaviour is not predictive of post-exposure (e.g., one-time onboarding events: the user either did or didn't complete onboarding once; pre-exposure has nothing to say about it). +- Pre-exposure window short enough that the behaviour you'd "control for" is itself a transient spike (e.g., metric just had a viral moment last week). + +### Pre-exposure window presets + +- `"2-weeks"` — fast-moving metrics with no strong weekly seasonality. +- `"4-weeks"` — most metrics with weekly seasonality (default sweet spot). +- `"60-days"` — deeply seasonal metrics like spend. +- `"90-days"` — long-cycle metrics (renewal-driven revenue, etc.). + +### What changes downstream + +- Required sample size shrinks by the variance-reduction factor. A 50% variance reduction on a primary that needed 60k per arm shrinks the target to ~30k per arm. +- The point estimate of the lift is unchanged. CUPED is a variance-reduction technique, not a bias correction; the headline lift is the same, the confidence interval is narrower. +- The post-launch interpretation step needs to know CUPED was on, because the standard error formula differs. The setting is persisted on the experiment object; the interpretation step reads it automatically. + +## Winsorization — outlier handling + +**What it does.** Caps extreme values at a percentile boundary (default 95th, i.e. cap the top 5% and bottom 5% at the 95th and 5th percentile values respectively). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. + +**Setting:** `settings.winsorization.enabled = true`, with `settings.winsorization.percentile` choosing the cap point. + +### When to enable + +- Revenue or spend metrics with whales (one customer spends 100× the median; that customer assigned to treatment is enough to swing the headline). +- Time-on-page or session-duration metrics with users who fall asleep on the page (one session at 8 hours dwarfs 10,000 sessions at 30 seconds). +- Any Gaussian-distributed metric with a heavy right tail (count metrics, event volume per user, page view counts). + +### When NOT to enable + +- Bernoulli (conversion) metrics. Capping a 0/1 outcome is meaningless; the 95th percentile of a 0/1 distribution is also 0 or 1. +- Metrics where the tail behaviour **is** the hypothesis. If the test is "did this change move whale spending?", Winsorization throws away exactly the signal you're testing for. +- Metrics already winsorized upstream (in the metric definition / data pipeline) — double-winsorization adds nothing. + +### Percentile guidance + +Default is 95 (cap top/bottom 5%). This is almost always right. Push back if the user sets `percentile < 80` — that's >20% of values being capped, which throws away too much signal. Confirm intent before launching. + +For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. 95 is the default for a reason. + +### What changes downstream + +- Variance on the affected metric drops, often substantially. Required sample size shrinks accordingly. +- The point estimate of the mean shifts toward the centre of the distribution. This is the desired behaviour; the whole point is to stop a few outliers from anchoring the estimate. +- The post-launch interpretation step reports the winsorized mean and standard error. If the team also wants to know what the un-winsorized mean did (the "did whales react?" question), they'd need a separate secondary metric without Winsorization. + +## Multiple testing correction — Bonferroni vs Benjamini-Hochberg + +Covered in detail in `references/statistical-model.md`. The short version: + +- Enable when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. +- Default to `"benjamini-hochberg"`. More powerful with correlated primaries. +- Use `"bonferroni"` when family-wise error control is required (regulatory, etc.) or when the primaries are independent. +- Set `"off"` only with a single primary and a single non-control variant. + +## Decision flowchart + +``` +Primary metric is Bernoulli (conversion rate)? +├── Yes → Winsorization OFF. +│ Does it correlate with pre-exposure behavior of existing users? +│ ├── Yes → CUPED ON (if 2-4 week pre-exposure window available, no new-user cohort) +│ └── No → CUPED OFF +└── No (continuous / count / retention) + Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? + ├── Yes → Winsorization ON (default percentile = 95) + └── No → Winsorization OFF + Does it correlate with pre-exposure behavior of existing users? + ├── Yes → CUPED ON (if 2-4 week pre-exposure window available, no new-user cohort) + └── No → CUPED OFF + +Primary count ≥ 2 OR non-control variants ≥ 2? +├── Yes → Multiple testing correction ON ("benjamini-hochberg" default; "bonferroni" for strict family-wise control) +└── No → Multiple testing correction OFF +``` + +## Common misconfigurations + +- ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. +- ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. +- ⛔ **Winsorization at percentile < 80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. +- ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/references/hypothesis-framing.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/hypothesis-framing.md new file mode 100644 index 0000000..57bbd73 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/hypothesis-framing.md @@ -0,0 +1,101 @@ +# Hypothesis framing + +A good experiment hypothesis is a **falsifiable, directional claim with a stated mechanism, bounded in time**. All four properties matter — drop any one and the design downstream silently degrades. + +## The shape + +> **If** ``, **then** `` will ``, **because** ``. + +| Property | Test | Failure mode | +| ------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | +| **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | +| **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | + +## When the user gives you a one-liner + +Ask them to commit to five things, in order. Don't proceed until you have all five. + +1. **The change** — what's different in treatment. A specific UI string, a routing change, a price, a copy variant. Vague ("the new onboarding") is not enough; "the new onboarding which moves the free-item offer to step 1" is. +2. **The primary outcome metric** — one specific event or rate, not a domain. "Engagement" is not a metric; "weekly active users with ≥1 report created" is. +3. **The expected direction** — up or down. (Goes straight into the metric's `direction` field.) +4. **The minimum effect size that would justify shipping** — this becomes the MDE. If the user can't name one, ask: "If the lift turned out to be 0.5%, would you ship?" Their answer reveals the MDE. +5. **The mechanism** — why you expect this to work. The mechanism is what binds the metric to the change. A change to onboarding screens shouldn't be measured by Day-30 retention if no one has gotten to Day 30 yet — the mechanism would say so explicitly. + +## Mechanism → metric class + +The mechanism predicts the _kind_ of metric that should move. Use this mapping as a sanity check: + +| Mechanism flavour | Likely primary-metric class | Anti-pattern | +| -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------- | +| Reduces friction at a specific step | Step conversion rate (funnel-typed) | Headline retention metric | +| Surfaces a new option / increases discoverability | Click-through or first-use rate on the surfaced option (conversion) | Total events per user | +| Reorders information / changes salience | Time-to-task, completion rate on the salient step | Account-level revenue | +| Changes the cost of an action (price, paywall, friction) | Conversion-to-paid, refund rate, cancel rate (with `direction: "down"`) | DAU | +| Adds a new content / recommendation system | CTR on recommendations, downstream conversion | Aggregate engagement | +| Long-term retention play (referrals, loyalty) | Day-7 or Week-1 retention as leading proxy; lagging Day-30 stays a post-launch monitor, not a primary | Day-30 retention as primary on a 2-week experiment | + +When the user's mechanism and proposed metric live on different rows of this table, push back — that's the **hypothesis ↔ metric mismatch** pitfall. + +## Hypothesis ↔ metric alignment + +A hypothesis names a specific outcome. The primary metric must measure that outcome — **same population, same denominator, same timeframe**. Common misalignments: + +- Hypothesis predicts a **rate** change; primary metric is a **count** → switch to a rate metric, or use an exposure-rebalanced total. +- Hypothesis predicts effect on **paid users**; primary metric includes free users → add a cohort filter or scope the metric. +- Hypothesis predicts effect **within session**; primary metric is **per-user across sessions** → either narrow the metric or broaden the hypothesis. +- Hypothesis predicts effect **only on a new flow**; primary metric counts events that exist only in treatment → changed-denominator. The lift is artificially infinite. Pick a metric that exists for both arms. + +## When to push back + +Push back hard when: + +- The hypothesis is non-falsifiable. Until it can be tested with a yes/no answer from data, there's nothing to set up. +- The hypothesis is non-directional. The system's `direction: "up"` default is wrong for cancel / error / latency / abandon metrics; leaving it default silently flips polarity at interpretation time. +- The mechanism doesn't predict the proposed metric. Most "experiment didn't work because we measured the wrong thing" post-mortems trace back to here. +- The proposed primary is strongly lagging on the planned duration (retention as primary on a 2-week test). Suggest a leading proxy. + +When you push back, do it once with concrete language ("you said 'improve engagement' — which event do you want to move?"). If the user genuinely wants to leave the hypothesis vague, you can proceed, but log the vagueness in `description` so the post-launch step knows the test was exploratory rather than decisional. + +## Worked examples + +### ✅ Good + +> If we surface a free-item offer during onboarding step 2, then signup→activation conversion will increase by ≥3pp (currently 18%), because reducing first-action friction lowers cold-start dropout for new accounts. + +- Falsifiable: data can say "no, lift was <3pp." +- Directional: up. +- Mechanistic: first-action friction → cold-start dropout. +- Time-bounded: signup→activation is a within-session metric; readable inside any reasonable test duration. +- Mechanism predicts a conversion-class primary; signup→activation conversion fits. + +### ✅ Good (lagging hypothesis, leading proxy primary) + +> If we ship the new referral flow, then Day-30 retention will increase by ≥1.5pp, because referred users have stronger network effects. We will measure Day-7 retention as the experiment primary (historical correlation r=0.78 with Day-30) and keep Day-30 as a post-launch monitor. + +- Bounded-in-time problem is acknowledged and solved with a leading proxy. The lagging metric remains a post-launch check, not a ship gate. + +### ❌ Vague + +> Test the new onboarding. + +- No change description (which change? full redesign or one screen?). +- No outcome. +- No direction. +- No MDE. +- No mechanism. + +Coach: pull each of the five commitments out of the user before going further. + +### ❌ Non-falsifiable + +> The new dashboard will improve the user experience. + +- "Improve user experience" can't be tested. Ask: "Which specific behaviour changes if user experience is better? Engagement events per session? Time to first chart? Dashboards saved per user?" + +### ❌ Mechanism doesn't predict the metric + +> If we change the colour of the CTA button, then 30-day retention will increase by ≥2pp, because users will perceive the product as more polished. + +- Mechanism is plausible at best, but Day-30 retention is far downstream of a button-colour change. Even if the colour change does help, a 2-week experiment won't measure it. Either pick a leading proxy (click-through on the CTA) or shelf the test until you have a more credible mechanism for retention. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/references/metric-selection.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/metric-selection.md new file mode 100644 index 0000000..8ce7ea0 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/metric-selection.md @@ -0,0 +1,75 @@ +# Metric selection + +Each metric on an experiment serves exactly one of three roles. The hypothesis tells you which. + +## Primary metrics (1–3 max) + +The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. + +- **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. +- **Explicit `direction`.** Every primary needs `direction: "up"` or `direction: "down"`. The system defaults to `"up"`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: + - Onboarding-screen change → activation in the first session, not Week-4 retention. + - Checkout button A/B → checkout conversion, not 30-day LTV. + - Pricing-page tweak → click-through and trial start, not annualised revenue. + - When the only metric the team cares about is lagging, use a **leading proxy** with a known historical correlation to the lagging metric. The lagging metric stays a post-launch monitor, not a ship gate. +- **Prefer rates over counts** when the hypothesis is about behaviour change. "Conversion rate" is interpretable; "total conversions" conflates per-user behaviour with cohort size. + +If the user proposes a primary, sanity-check: + +- _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) +- _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). +- _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. +- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) + +## Guardrail metrics (0+, strongly recommended) + +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. + +Standard guardrails by domain — pick at least one from the row that matches the change: + +| Change targets… | Guardrail candidates | +| ------------------------------------ | ------------------------------------------------------- | +| Performance / UI / new client code | Page load time, API latency, error rate, crash rate | +| Engagement / activation / onboarding | Weekly active users, session count, Day-7 retention | +| Revenue / monetisation / pricing | ARPU, conversion-to-paid, refund rate, cancel rate | +| Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | +| Time-to-task / search / IA | Task abandonment rate, time-to-completion | + +For every guardrail, **set `direction` explicitly**. A guardrail named "errors" with default `direction: "up"` will silently let regressions slip through interpretation as "wins." + +Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor. + +## Secondary metrics (0+, diagnostic only) + +Metrics for understanding **why** the primary moved, not for the ship decision. Examples: funnel-step completions, feature sub-use rates, time-on-screen, exploratory cohort breakdowns. + +**Secondary metrics are not decisional.** Even if the user names a secondary in their hypothesis text, they cannot ship/kill on its result. If a metric matters for the decision, it must be primary or guardrail. + +> **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: +> _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ + +This is the `hypothesis_metric_mismatch` pitfall in pre-launch detection — see `references/pitfalls.md`. + +## Sanity checklist + +Run this before locking the metric set: + +- [ ] Each primary directly measures the hypothesis's predicted outcome. +- [ ] Each primary has explicit `direction` (no `null`). +- [ ] At least one guardrail covers the most likely failure mode of the change (perf for UI changes, retention for monetisation changes, etc.). +- [ ] Each guardrail has explicit `direction`. +- [ ] No metric whose denominator is created by the treatment itself (changed-denominator). +- [ ] No primary or guardrail is a strong lagging indicator on the planned experiment duration (use leading proxies; demote lagging metrics to post-launch monitors). +- [ ] Total primary count ≤ 3. +- [ ] If primary count ≥ 2 OR non-control variants ≥ 2, multiple-testing correction is on (`benjamini-hochberg` default, `bonferroni` for strict family-wise control). +- [ ] For each primary, baseline rate has been pulled from real data (not guessed). + +## Anti-patterns + +- ⛔ **No guardrails to "avoid noise."** Guardrails are the regression detection, not noise. Without them, a winning primary with a quietly regressing latency or refund-rate is a ship — and then a rollback two weeks later. +- ⛔ **Five primaries because "they're all important."** Past 3, the false-positive risk dominates. Pick the 1–3 the hypothesis actually predicts; demote the rest to secondaries. +- ⛔ **Primary = "total signups," metric = behaviour change.** A behaviour-change hypothesis needs a rate metric; total signups conflates per-user behaviour with the size of the cohort that entered the experiment. +- ⛔ **Guardrail with default `direction: "up"` on an error / cancel / latency metric.** Silently inverts the regression check. +- ⛔ **30-day retention as primary on a 2-week experiment.** Either the lagging metric can't move (no signal) or it moves on noise (false significance). Use a leading proxy. +- ⛔ **Primary metric only exists in treatment.** Changed denominator. Lift is meaningless. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/references/pitfalls.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/pitfalls.md new file mode 100644 index 0000000..dc7cc18 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/pitfalls.md @@ -0,0 +1,135 @@ +# Pre-launch pitfalls + +This is the catalogue of the deterministic checks the agent runs before the user creates an experiment. **Detection logic lives in the platform's pre-launch validation capability**; this document owns the prose — the _why_ behind each check — so the agent can explain the violation in human terms rather than just nagging. + +For the source-of-truth severities, thresholds, and message templates, see `ai/engine/tools/experiments/_shared/pitfall_prose.py` in `mixpanel/analytics`. When that file changes, this document changes too. + +## Triage order + +The agent surfaces pitfalls in this order: + +1. **Blockers first.** An experiment that triggers a blocker should not launch as-is. Two pitfalls today: `underpowered_duration_insufficient` and `cohort_too_small`. Both mean the experiment literally cannot reach statistical power for the configured MDE. +2. **Warnings next.** Configuration smells that would degrade interpretability or trustworthiness. Most fall here. +3. **FYIs last.** Soft nudges; not blocking even if the user ignores them. + +Within a severity tier, surface in this order (most actionable first): data-trust risks (pre-experiment bias, variance inflation) → configuration nudges (guardrails, hypothesis alignment). + +## The >5% guardrail hard-gate + +The single most important rule in the catalogue. **A 5% relative regression on any guardrail blocks ship even if the primary wins.** + +### Why 5% + +The threshold is calibrated to be tight enough to catch real degradations of user experience, revenue, or performance, and loose enough that day-to-day noise on a moderately-volatile guardrail doesn't trip it on every test. + +- Below 5%: typically within the noise band of most guardrails on a 2-week test. Tightening below 5% would generate too many false alarms. +- Above 5%: the team has implicitly traded measurable user/revenue/performance damage for headline-metric lift. That's not a ship — that's a re-design. + +### Why "hard gate" + +Guardrails are not "things to also look at." They are the **trustworthiness backstop**. A winning primary with a regressing guardrail means the change _exchanged_ something the team agreed must not regress for the headline-metric lift. If guardrails are negotiable, they aren't guardrails. + +### Why explain it to the user + +The most common reaction to a guardrail regression is "but the primary won, can't we just ship?" The agent's job is to make the trade-off explicit: + +> "Primary metric `` won by +2.3pp, but guardrail `` regressed by 7.4%. The 5% threshold exists because guardrails are the trustworthiness backstop — a winning primary with a regressing guardrail means you've traded `` for ``, which is a design choice that needs explicit sign-off, not a ship decision." + +If the team genuinely wants to make that trade, they can disable the guardrail before launch and document the decision in `description`. Don't let them silently override; force the conversation. + +--- + +## The catalogue + +Each entry lists: kind → severity → trigger condition → why it matters → what to recommend. The message templates are in `pitfall_prose.py`; reproduced inline here for context. + +### `underpowered_duration_insufficient` — blocker + +**Trigger.** Expected exposures (`exposures_per_day × planned_days × n_arms`) are less than 50% of the per-arm sample size required to detect the configured MDE at the baseline rate. + +**Why it matters.** The experiment cannot reach statistical power for this MDE no matter how clean the rest of the config is. If launched, the most likely outcome is "inconclusive" — and a non-trivial fraction of those inconclusive results will be due to noise crossing the significance threshold rather than a real effect, the winner's-curse problem. + +**Recommendation.** Extend planned duration by roughly `(n_required − expected_exposures) / exposures_per_day` days, OR relax the MDE (only ship if the lift is bigger), OR pick a higher-volume primary metric, OR enable CUPED if pre-exposure data is available (which can cut required `n` by 30–70%). + +### `cohort_too_small` — blocker + +**Trigger.** Cohort size is smaller than `num_arms × target_sample_size`. The cohort cannot supply enough eligible users. + +**Why it matters.** Same root cause as the duration blocker, different lever. Even with infinite time, the experiment will run out of eligible users before each arm reaches the per-arm target. + +**Recommendation.** Either expand the cohort to ~`num_arms × target_sample_size` eligible users (relax filters, broaden segment, extend eligibility window), or lower the per-arm target sample size to what the cohort can actually supply (and accept the larger achievable MDE that comes with it). + +### `pre_experiment_bias_likely` — warning + +**Trigger.** Retrospective A/A is enabled, at least one continuous-ish metric (continuous, retention, or funnel) is configured, AND CUPED is off. + +**Why it matters.** Pre-experiment bias is likely on metrics with seasonality or power-user skew. Without CUPED to absorb the baseline difference, post-experiment lifts will inherit it — the team will see "treatment up 2%" when the real treatment effect is 0% and the baseline difference is +2%. + +**Recommendation.** Enable CUPED with a 2–4 week pre-exposure window. CUPED specifically handles this case: it regresses out the pre-exposure baseline difference so the post-exposure lift is the actual treatment effect. + +### `high_variance_no_winsorization` — warning + +**Trigger.** At least one continuous-ish metric is configured AND Winsorization is off. + +**Why it matters.** Outliers will inflate variance and widen confidence intervals. A handful of power users can dominate the per-arm mean, swinging the headline based on which arm those users got assigned to. + +**Recommendation.** Enable Winsorization with default percentile 95. Push back if the user sets percentile <80 (that's >20% of values capped — almost always a misconfiguration). + +### `multiple_primaries_no_bonferroni` — warning + +**Trigger.** ≥2 primary metrics configured AND multiple-testing correction is off. + +**Why it matters.** Family-wise false-positive rate compounds with each additional primary. At 3 primaries the FPR is ~14.3%; at 5 it's ~22.6% — more than one in five "wins" is noise. + +**Recommendation.** Enable multiple-testing correction. Default to Benjamini-Hochberg (more powerful with correlated metrics); use Bonferroni for strict family-wise error control. The name of this pitfall is historical — the correction need not be Bonferroni specifically. + +### `underpowered_duration_marginal` — warning + +**Trigger.** Expected exposures are between 50% and 100% of the per-arm sample size required for the configured MDE. + +**Why it matters.** Marginally underpowered. The experiment might reach significance on a true effect; it might not. Either way, the lift estimate at conclusion will be wider than expected. + +**Recommendation.** Extend duration to reach 100%+ of the required sample, or accept the higher Type-II error rate (more chance of missing a real effect). Less urgent than the `_insufficient` variant. + +### `missing_guardrails` — warning + +**Trigger.** Zero guardrail metrics configured. + +**Why it matters.** Without guardrails, there's no >5% hard-gate to block a ship on a regression to user experience, revenue, or performance. The team is implicitly trusting that the primary metric captures every relevant impact — which is rarely true. + +**Recommendation.** Add at least one guardrail covering the most likely failure mode of the change. Standard choices: + +- UI change → page-load time or error rate. +- Monetisation / pricing → cancel rate or refund rate. +- Engagement change → Day-7 retention or session count. +- Performance change → error rate or crash rate. + +### `hypothesis_metric_mismatch` — warning + +**Trigger.** The hypothesis text mentions one of the canonical metric nouns (`conversion`, `retention`, `revenue`, `signup`, `engagement`, `click`, `purchase`) but no primary metric's name appears to measure that outcome. + +**Why it matters.** Soft signal — the heuristic is coarse, but it catches the common case where the user wrote "X will increase conversion" and then set the primary to "session count" or vice versa. If the user's hypothesis is about conversion, the primary should be a conversion metric. + +**Recommendation.** Phrase as a question, not a verdict: _"Your hypothesis mentions ``, but no primary metric name suggests it measures that. Should `` be replaced or supplemented with a metric that more directly tests the hypothesis?"_ + +### `primary_lacks_leading_indicator` — warning + +**Trigger.** Primary metrics include a retention-type metric (lagging by construction) AND no leading-indicator secondary (conversion or funnel type) is configured. + +**Why it matters.** A retention primary is valid but reads slowly — there may not be enough signal to interpret results before the experiment concludes. Without a leading-indicator secondary, the agent has no early-read evidence to reason from. + +**Recommendation.** Add a leading-indicator secondary metric (a conversion or funnel metric measured within the experiment runtime). The retention primary stays as the ship decision; the secondary just gives early visibility. + +--- + +## Detection vs prose + +The detection math lives in the platform's pre-launch validation capability. The prose lives here and in `pitfall_prose.py`. The two are connected by the pitfall `kind` field — the validation step reports the kind, the agent renders the message. + +This separation lets product retune phrasing (this document, `pitfall_prose.py`) without touching the detection helpers, and vice versa. When you update a threshold (e.g., the 50% / 100% bounds on the underpowered checks), update the helper math, the `_shared/pitfall_prose.py` constant, and the recommendation in this document together — the agent will quote stale numbers if any of the three drifts. + +## What's not in the catalogue (yet) + +- **Cross-test contamination** — when the same users are eligible for multiple concurrent experiments on the same surface. Hard to detect statically; usually surfaces as anomalous variance at interpretation time. +- **Novelty effect detection** — early days of the experiment show inflated treatment effect, then settle. Not a pre-launch check; lives in the post-launch interpretation skill. +- **Seasonality misalignment** — running a 2-week experiment that doesn't align to weekly cycles. Today this is detected indirectly via the duration check; a future explicit `seasonality_misaligned` pitfall is a reasonable add. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/references/prior-experiments.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/prior-experiments.md new file mode 100644 index 0000000..5a0beee --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/prior-experiments.md @@ -0,0 +1,81 @@ +# Prior experiments + +When a user proposes an experiment on a feature, **the first thing to do is look up prior experiments on that feature**. Skipping this leads to redundant tests, contradictory ship decisions, and wasted traffic. + +## The lookup + +Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. + +If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. + +## What to do with what you find + +### Same feature already tested and shipped + +Reference the prior result before recommending a new test. The right answer is often "don't re-run; iterate on a new hypothesis." + +> "There's a prior experiment from [date] on the same feature with a similar hypothesis: it shipped at +X% on metric Y. Re-running won't tell us anything new. What's different about the change you're proposing? Is the new hypothesis about a different sub-population, a different metric, or a different mechanism?" + +If the user does want to re-run (e.g., the population has shifted significantly, the underlying product has changed, or the prior test was clearly underpowered), proceed — but design the new test to specifically address what's different from the prior. + +### Same feature tested and killed + +Treat this as a strong prior. Ask why the user thinks the new variant will work where the prior didn't. + +> "Prior experiment [date] on the same surface killed at [-X% / inconclusive]. What's different about your change that should produce a different outcome? If the prior failed because of [mechanism], does your change address that?" + +If the user can articulate a different mechanism, run the new test. If they can't, the most likely outcome is a repeat of the prior result — discourage the test or downgrade its priority. + +### Earlier iteration of the same hypothesis + +Use the prior result to inform the new design — specifically, **baseline rates and variance estimates**. Prior data is much more reliable than guessing. + +- Pull the prior's control-arm baseline rate; use it as the baseline for the new sizing calculation. +- Pull the prior's observed variance; use it instead of estimating from scratch. +- Pull the prior's exposure rate (exposures per day per variant); use it to set a realistic duration estimate. + +This often shrinks the required sample size or shortens the planned duration. Both are wins worth surfacing. + +### Recently concluded with similar metrics + +Pull the realised exposure rate. The "expected exposures per day" the user has in mind is usually higher than what actually shows up in a real experiment on the same surface — eligibility filters, opt-outs, and bot exclusion all bite. Use the prior's actual rate, not the theoretical one. + +### Multiple prior experiments on adjacent surfaces + +Look for **patterns**, not single data points. If three prior tests on the same funnel stage all moved in the same direction by similar magnitudes, that's the realistic prior for what the new test will do. If the prior tests are noisy or contradictory, treat the new test's expected lift with more uncertainty and consider running it longer. + +## Folding prior results into the new design + +Concretely, when you have a prior result that's relevant, the setup workflow changes as follows: + +| Step | Without prior | With prior | +| -------------------------- | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | +| Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | +| Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | +| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | + +## When prior tests warn you away from testing at all + +Sometimes the prior data tells you the right answer is **don't run the experiment**: + +- The metric the user wants to move has been tested 4 times on this surface in the last year, all with inconclusive or null results, all adequately powered. The hypothesis-space is likely exhausted; suggest a different mechanism or a different surface. +- The baseline rate is so low that even the prior, well-powered tests couldn't detect anything below a 30% relative lift. The new test would inherit the same constraint. Either pick a higher-volume proxy metric or accept that the change has to be very large to be detectable. +- Recent guardrail regressions on the same surface suggest the surface is unstable; running more experiments without first fixing the trust issue is wasted traffic. + +Surface these findings as recommendations, not blockers. The user might have context the prior data doesn't capture. + +## What to record about the new design's relationship to prior tests + +In the experiment's `description` field, link to the prior experiment(s) and note how the new design differs. This becomes critical at interpretation time — the post-launch step uses the prior context to calibrate its read of the new result. + +A useful template: + +``` +Prior: tested on , result: . +This experiment differs by: . +Inherited from prior: baseline rate (X%), σ², exposure rate (N/day/variant). +``` + +This is a 30-second annotation that pays back tenfold at analysis time. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/routing-xp-vs-ff.md new file mode 100644 index 0000000..a38a72d --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/routing-xp-vs-ff.md @@ -0,0 +1,85 @@ +# XP vs FF: routing intent + +Before any setup work, decide whether the user actually wants an **experiment** (XP) or just a **feature flag** (FF). The decision is binary, but the language users use is blurry — "let's A/B test this" sometimes means "let's run a controlled experiment with a hypothesis and a stopping rule," and sometimes means "I want to ship it to 10% of users and see if anything breaks." + +## The discriminator + +| If the user wants… | Then it's a… | +| --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | +| Causal evidence — "does this change move metric X by enough to justify shipping?" | **Experiment** (XP). | +| Progressive rollout — "ship to 10%, then 50%, then 100% if nothing breaks." | **Feature flag** (FF). | +| Kill-switch — "I want to be able to turn this off instantly if it goes sideways." | **Feature flag** (FF). | +| Per-segment gating — "only show this to enterprise customers." | **Feature flag** (FF). | +| Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | +| Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | + +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when you call `create_experiment`); not every feature flag use case needs an experiment. + +## Disambiguation prompt + +When you can't tell from the user's wording, ask once, plainly: + +> "Are you trying to **measure** whether this change moves a metric (experiment), or are you rolling it out gradually / behind a flag with **no measurement criterion** (feature flag)? An experiment commits to a hypothesis, metrics, and a stopping rule; a feature flag is purely a delivery mechanism." + +Listen for these signals in the answer: + +- "I want to see if it improves X" / "if checkout conversion goes up" → experiment. +- "I want to make sure it doesn't break X" → could be either. Probe: "Is 'doesn't break' a measurable threshold, like a guardrail, or is it 'I'll watch dashboards and roll back if it's obviously bad'?" +- "I want enterprise to get it first" / "I want to roll out by region" → feature flag. +- "I just want a kill switch" → feature flag. +- "I want to ship it and prove ROI later" → ask whether the proof needs to be causal. If yes, that's an experiment, and it should be set up _before_ shipping, not after. (Post-hoc ROI claims from a flag rollout are not credible.) + +## Common ambiguous cases + +### "Ship to 10% as an experiment" + +Often this means "phased rollout, monitor metrics, ramp if nothing regresses." That's a feature flag with manual ramp logic, not an experiment. + +Ask: "Do you have a primary metric you're committing to before launch, with an MDE that decides whether to ship to 100%?" If yes, run as an experiment. If no, ship as a flag. + +### "I want to test the new pricing on enterprise customers" + +If "test" means "see how they react and decide whether to roll out," and the audience is small (a few enterprise customers), that's a **rollout**, not an experiment. Enterprise samples are usually too small to power an experiment, and the per-account variance is too high for a meaningful aggregate. + +Run as a flag, gather qualitative feedback, and decide based on the conversations — not on a p-value computed from N=4. + +### "Hold out a control while we ship to 100%" + +This is the classic "holdout experiment." Legitimate use case, but it has to be set up as an experiment up front (with a primary metric and a duration), not retroactively. After-the-fact holdout analysis suffers from selection bias and is not credible. + +If the user has already shipped to 100% and wants to "analyse the effect," there is no experiment to set up. Tell them so, and suggest a forward-looking test on the next change to the same surface. + +### "Just give me an A/B test, the simplest one" + +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. + +### "I want a feature flag but with stats" + +Now you're back to an experiment. Run the full setup workflow. + +## What changes once you've routed + +### If experiment + +Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. + +### If feature flag + +This skill stops. Hand off to the user (or to a `manage-feature-flags` skill if one exists): + +- They configure variants, targeting, and rollout percentages directly. +- No hypothesis, no MDE, no stopping rule needed. +- Mixpanel doesn't compute lift or significance on a flag — they're on their own for observation. + +Make sure the user understands the trade-off explicitly: "Choosing flag means you give up the ship/no-ship decision criterion. If later you want to claim the change worked, that claim won't have the same evidentiary weight as a properly-designed experiment." + +## Don't run an experiment when + +There are cases where an experiment is technically possible but the wrong move: + +- **Sample is too small.** Enterprise rollouts to ~10 accounts cannot power a real test. Ship as a flag and use qualitative feedback. +- **Treatment is risky/irreversible.** A real billing change with potential refunds shouldn't run as a 50/50 split — phase as a flag with conservative rollout and direct monitoring. +- **No baseline data.** Brand-new metric, brand-new feature, no historical observation. Run a 1–2 week passive observation period first, then design the experiment from real numbers. +- **Hypothesis is "let's see what happens."** No directional commitment means the test will be interpreted post-hoc, which is the same as not running an experiment. + +Suggest the alternative explicitly so the user doesn't feel rejected — "this isn't an experiment-shaped problem; here's what to do instead." diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/references/sizing.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/sizing.md new file mode 100644 index 0000000..8122307 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/sizing.md @@ -0,0 +1,109 @@ +# Sizing the experiment + +You almost never know the right sample size by guessing. Pull the data first, then run the math. + +## The standard formula + +Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Where: + +- `σ²` = variance of the metric (depends on metric type — see below). +- `d` = MDE in the same units as the metric. + +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. + +## Variance by metric type + +- **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. +- **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. + +## Worked example + +Detecting a 5% **relative** lift on a 10% baseline conversion rate at 80% power, 95% confidence: + +``` +p = 0.10 +σ² = 0.10 × 0.90 = 0.09 +absolute MDE = 0.10 × 0.05 = 0.005 +n = 16 × 0.09 / 0.005² = 16 × 0.09 / 0.000025 = 57,600 per variant +``` + +That's ~57,600 per variant for a 5% relative lift — humbling, and surprising to most teams. Most "we'll just run it for two weeks" plans don't survive contact with this number. + +## Kohavi's inverted formula + +For most online experiments, traffic is the constraint, not patience. Pick a duration (2–4 weeks captures weekly cycles), use all available traffic in that window, then compute the **achievable MDE**: + +``` +MDE = 4σ / √n +``` + +This tells the user: "given your traffic, the smallest effect you can reliably detect is X." If that achievable MDE is larger than the lift the user actually expects, the experiment is **underpowered**. Flag immediately. + +Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." + +## Estimating the inputs from real data + +For each primary metric, before sizing, you need three numbers: + +1. **Baseline rate** — query the metric over the prior 2–4 weeks (the longer of: one full business cycle, or four weeks). Record `mean` and `variance`. Use the same event definition, segment filters, and unit-of-analysis you'll use in the experiment — a baseline computed differently from how the metric is configured in the experiment is worse than no baseline at all. +2. **Daily traffic** — query the exposure event (or whatever event qualifies users for the experiment) over the same window, grouped by day. Average to get expected exposures per day per variant. +3. **MDE the user wants** — ask explicitly. _"What's the smallest lift that would be worth shipping?"_ If they don't know, propose a 5–10% relative lift and confirm. + +From those three: + +``` +required_sample_per_variant = 16 × σ² / (baseline × MDE_relative)² +required_days = required_sample_per_variant × n_variants / daily_traffic_per_variant +``` + +If `required_days > 28` (four weeks), the experiment is **underpowered for the requested MDE on available traffic**. Tell the user. Don't wave it through. + +## Five remediations when the experiment is underpowered + +Offer these in order of cost — cheap first. + +1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. +2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. +5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. + +## Sample-size floor + +Independent of the math: never set `sampleSize` below ~350–400 per variant. Below this, the statistical machinery itself becomes unreliable — CLT breaks down, the SRM check gets noisy. The Mixpanel default of 10,000 per variant is fine for most tests; 1,000 is the practical floor; 350–400 is the absolute floor. + +If the math says `n = 50` per variant, the test is either trivially easy (the lift is huge) or the variance estimate is wrong. Sanity-check before launching at the floor. + +## Lookup table (Bernoulli, 95% conf, 80% power) + +For a Bernoulli (conversion-rate) primary metric at 95% confidence, 80% power, two-sided test, MDE expressed as a **relative** lift on the baseline: + +| Baseline rate | MDE = 5% relative | MDE = 10% relative | MDE = 20% relative | +| ------------- | ----------------- | ------------------ | ------------------ | +| 1% | ~633k / variant | ~158k / variant | ~40k / variant | +| 5% | ~122k / variant | ~31k / variant | ~7.6k / variant | +| 10% | ~58k / variant | ~14k / variant | ~3.6k / variant | +| 25% | ~19k / variant | ~4.8k / variant | ~1.2k / variant | +| 50% | ~6.4k / variant | ~1.6k / variant | ~400 / variant | + +Use this for quick sanity-checking. Always confirm with a query against actual baseline data — these are illustrative. + +## Sample-size growth with variants + +For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. + +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. + +## Duration considerations + +- **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. +- **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-setup/references/statistical-model.md b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/statistical-model.md new file mode 100644 index 0000000..6158008 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-setup/references/statistical-model.md @@ -0,0 +1,100 @@ +# Statistical model + +Once required sample size and acceptable duration are known, two settings are left: `settings.testingModel` and `settings.endCondition`. Plus the two adjacent settings that change how those tests are interpreted: `settings.confidenceLevel` and `settings.multipleTestingCorrection`. + +## Testing model: sequential vs frequentist + +**Default to `sequential`** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. + +### Pick `sequential` when + +- The user expects a **large lift** and wants to confirm or reject the hypothesis quickly. Sequential lets you stop the moment significance is reached — often days or weeks before a frequentist target. +- The user wants to check results before the experiment ends and act on them (early-stop on a clear winner). +- The expected effect size is uncertain (could be huge, could be tiny). Sequential adapts; frequentist needs you to commit to one MDE up front. +- The team will look at intermediate results regardless. Sequential prevents peeking from inflating false positives. +- The user is comfortable with slightly more complex stopping rules ("stop when the test-statistic crosses the boundary," not "stop when n reaches N"). + +### Pick `frequentist` when + +- The user is hunting for a **very small lift** (e.g. 1–2% relative on a high-volume metric). Frequentist's fixed-sample design is statistically more efficient at the margin and avoids the early-stop boundary inflation that costs power on tiny effects. +- The team is comfortable waiting for the full sample before checking results — no peeking. +- The team prefers wider industry familiarity ("we used a t-test"). +- The user wants the simplest reportable statistics (a single p-value and confidence interval at the end). +- The team has internal training / tooling that assumes frequentist. + +### The "I want to peek with frequentist" trap + +The most common request is "I want frequentist, but I also want to look at the results during the test." This inflates the false-positive rate enormously — naive peeking on a frequentist test at 5 evenly-spaced check-ins pushes the family-wise α from 5% to ~14%. + +Switch them to sequential. Sequential's whole point is making peeking safe. + +If the user insists on frequentist + peeking (some teams do, for tooling reasons), document the decision in `description` so the interpretation step later knows the reported p-values overstate confidence. + +## End condition: sample_size vs days + +### Pick `sample_size` when + +- The team has a target MDE and wants the experiment to stop the moment the required sample is reached. Adaptive duration. +- Daily traffic is highly variable. Sample-size-based ends absorb the variability; date-based ends don't. +- There's no strong seasonality in the primary metric that would bias a mid-cycle stop. + +### Pick `days` when + +- The primary metric has **strong weekly (or other periodic) seasonality**. Pin the duration to a multiple of the seasonal cycle so each variant sees the same mix of high- and low-traffic periods. + - A common pattern: customers with strong weekday/weekend behaviour shifts run all experiments in 1-week increments (or 2 weeks for a stricter check) to fully capture each cycle. + - A `sample_size` end can fire mid-cycle and produce biased results in this case. +- The team has a fixed business window (e.g. "we want to ship by end of quarter"). +- The team has historically struggled with experiments running indefinitely. +- The hypothesis specifically requires a calendar window (e.g. a holiday-season test). + +### Combinations + +All four combinations are valid. The one customers most often miss is **`frequentist + days`** — some teams prefer time-based experiments for operational reasons even when running frequentist tests. Don't flag this as a misconfiguration. + +The one that's actually wrong is **`frequentist + sample_size + peeking`** — that's the "peeking trap" above. Surface it; switch them to sequential. + +## Confidence level + +Default `settings.confidenceLevel: 0.95` (α = 0.05). Change only with intent. + +- **`0.99`** — for high-stakes irreversible ships (e.g. billing changes, deletion-flow changes, anything regulatory). Higher false-negative cost; accept it. Document the reason in `description`. +- **`0.90`** — for low-stakes exploratory tests where speed matters more than rigour. Acknowledge the inflated false-positive rate to the user explicitly: at α = 0.10, one in ten "wins" is noise. + +Any change away from 0.95 belongs in `description`. The post-launch interpretation step uses this field to read the result correctly; without it, a "win" at 0.90 looks the same as a "win" at 0.95. + +## Multiple testing correction + +Enable when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. Without correction, the family-wise false-positive rate compounds: + +| Primaries | Non-control variants | Family-wise FPR at per-test α = 0.05 | +| --------: | -------------------: | -----------------------------------: | +| 1 | 1 | 5.0% | +| 2 | 1 | ~9.75% | +| 3 | 1 | ~14.3% | +| 5 | 1 | ~22.6% | +| 5 | 2 | ~40.1% | +| 5 | 3 | ~53.7% | + +The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. + +Two methods are available: + +- **`"bonferroni"`** — divides α by the number of tests (`n_primary × n_non_control_variants`). Simple and conservative. Guarantees the family-wise error rate stays below α, but can be overly strict when many primary metrics are correlated, hurting power. +- **`"benjamini-hochberg"`** — controls the **false discovery rate** (FDR) instead of the family-wise error rate. Ranks all primary-metric p-values and applies progressively looser thresholds. More powerful than Bonferroni when there are many primary metrics, especially when some have real effects. Preferred when the user has 3+ primaries or correlated metrics. + +**Default to `"benjamini-hochberg"`** for most experiments — less conservative, better suited to typical designs with correlated metrics. Use `"bonferroni"` when: + +- The user needs strict family-wise error control (regulatory, high-stakes decisions where any single false positive is unacceptable). +- The primary metrics are independent (no shared drivers / overlapping populations), in which case Bonferroni's conservatism is not a real cost. +- The team explicitly asks for the simplest method to defend in a review. + +Set `settings.multipleTestingCorrection: "off"` **only** when there's a single primary and a single non-control variant. + +## Power vs significance trade-off + +When the user pushes you on `confidenceLevel`: + +- Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." +- Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. + +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric. diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-setup/SKILL.md new file mode 100644 index 0000000..7de583c --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/SKILL.md @@ -0,0 +1,151 @@ +--- +name: experiment-setup +description: "Coach an experimenter through designing a Mixpanel experiment before launch — hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), and pitfall avoidance. Use when the user wants to set up, configure, design, plan, or sanity-check a new A/B test, feature-flag experiment, or growth experiment. Also trigger on phrasings like 'help me set up an experiment', 'design an A/B test', 'should this be sequential or fixed', 'what MDE can I detect', 'how long should this run', 'is my experiment configured correctly', 'pre-launch checklist', 'should I use CUPED / Winsorization / Bonferroni', 'is this an experiment or just a feature flag', or when the user names a specific feature they want to test. Do NOT use for post-launch results analysis ('how did experiment X do?', 'should we ship?', 'why is SRM failing?') — that belongs to the `experiment-results` skill. Do NOT use for plain feature-flag rollouts with no measurement criterion — that belongs to the `feature-flags` skill." +license: Apache-2.0 +--- + +# Experiment Setup + +Coach the user through designing a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, and the sample size + traffic dictate duration and testing model. Reach into `references/` only when a step needs depth. + +## Requirements + +- Access to Mixpanel (event schema, run queries, create experiments and feature flags). +- Access to a prior-experiments lookup when one is available — the skill works without it, but degrades gracefully and tells the user what it skipped. + +## When to use this skill + +Trigger on any of: + +- "Set up / design / configure / plan an experiment on ``." +- "Help me write a hypothesis." +- "What MDE can I detect with my current traffic?" +- "Should this be sequential or fixed-horizon?" +- "Should I enable CUPED / Winsorization / Bonferroni?" +- "How long should this experiment run?" +- "Is this an experiment or should I just ship a feature flag?" +- "Sanity-check / pre-launch / pitfall-check this experiment configuration." + +Do **not** trigger for post-launch analysis ("how did experiment X do?") — that's the `experiment-results` skill. + +--- + +## Pre-flight: route and check for prior work + +**Route XP vs FF before designing.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill-switch, or per-segment gating with no decision criterion → feature flag (route to the `feature-flags` skill). If ambiguous, ask once: "Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?" Deeper disambiguation in [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md). + +**Always search for prior experiments on the same feature first** (by keyword from the feature name, when the lookup is available). Surface anything you find — re-running settled questions wastes traffic, and prior baseline/variance numbers sharpen the new MDE. See [references/prior-experiments.md](references/prior-experiments.md) for the fold-in playbook. + +--- + +## Workflow: 4 steps + +Run in order. Each step's output is the next step's input. + +### Step 1 — Write the hypothesis + +A good hypothesis is a **falsifiable directional claim with a stated mechanism**: + +> **If** ``, **then** `` will ``, **because** ``. + +If vague, hold the user to four commitments: the change, the primary metric, the direction, and the smallest effect worth shipping (the MDE). The "because" forces them to check whether the metric they picked is actually downstream of the change. Deeper rubric and misalignment patterns in [references/hypothesis-framing.md](references/hypothesis-framing.md). + +### Step 2 — Pick metrics that test the hypothesis + +Each metric serves one role: + +- **Primary (1–3 max).** Decides ship/no-ship. Comes from the hypothesis's outcome clause. Each additional primary inflates the false-positive rate. +- **Guardrail (0+, strongly recommended).** Must not regress. A >5% relative regression on any guardrail blocks ship even if the primary wins. +- **Secondary (0+).** Diagnostic only. Never decisional. + +Every primary and guardrail needs an explicit `direction` (`"up"` or `"down"`). The default `"up"` is wrong for cancel / error / latency / abandon metrics — leaving it default silently flips polarity at interpretation. Watch for **lagging-metric / window mismatch** (30-day retention as primary on a 2-week experiment) and the **changed-denominator** trap (metric defined only over treatment-exposed users). Full sanity checklist in [references/metric-selection.md](references/metric-selection.md). + +### Step 3 — Size the experiment with historical data + +Pull baseline rate, variance, and daily traffic from Mixpanel; don't guess. The standard formula (two-sample, two-sided, 95% confidence, 80% power): + +``` +n = 16 × σ² / d² (per variant; Bernoulli σ² = p(1−p)) +``` + +Inverted for traffic-bound teams — the smallest detectable effect at your traffic: + +``` +MDE = 4σ / √n +``` + +If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface this immediately (winner's curse, etc.). Sample-size floor: never below ~350–400 per variant (CLT breaks down, SRM check gets noisy). Worked examples, baseline-lookup table, and the five remediations for underpowered experiments are in [references/sizing.md](references/sizing.md). + +### Step 4 — Pick testing model + end condition + +**Default to `sequential`** for most users. Peeking is the most common customer mistake; sequential makes early-look safe. Override to `frequentist` for small-lift hunts on well-sized experiments, or when the team needs t-test familiarity. + +**End condition.** `sample_size` when daily traffic is variable; `days` when the primary metric has strong weekly seasonality. Frequentist + days is supported — don't flag it. + +**Confidence level.** Default 0.95. Bump to 0.99 only for irreversible high-stakes ships; drop to 0.90 only for exploratory low-stakes tests (and tell the user the family-wise FPR is inflated). + +**Multiple-testing correction.** Auto-needed when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. Default to `"benjamini-hochberg"`; use `"bonferroni"` for strict family-wise control (regulatory). Without correction the family-wise FPR climbs fast — 5 primaries × 3 variants → ~54%. + +Full decision tree and worked numbers in [references/statistical-model.md](references/statistical-model.md). + +--- + +## Advanced features (rationale: [references/advanced-features.md](references/advanced-features.md)) + +- **CUPED** — variance reduction. Enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history exists. Do not enable on new-user-only experiments or one-time-event metrics. +- **Winsorization** — caps extreme values. Enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli metrics. Push back if `percentile < 80`. + +## Pre-launch pitfall check + +Before the user creates the experiment, run the pitfall catalogue in [references/pitfalls.md](references/pitfalls.md). Surface only what fires on the current config; order blockers → warnings → fyi. + +Two **blockers** that should stop launch: + +- `underpowered_duration_insufficient` — expected exposures < 50% of per-arm sample size for the configured MDE. +- `cohort_too_small` — eligible cohort < `num_arms × target_sample_size`. + +The **>5% guardrail hard-gate** rationale: a 5% relative regression on any guardrail blocks ship even if the primary wins. A winning primary with a regressing guardrail trades headline lift for damage to something the team explicitly said must not regress — not a ship. + +--- + +## Output + +Present a compact summary the user confirms before you create the experiment: + +``` +*Experiment Setup Summary* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction: up/down), … +• *Guardrails:* (direction: …), … +• *Variants:* control 50% / treatment 50% (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample_size (per-arm ) | days ( days) +• *Confidence level:* 0.95 +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days +• *Achievable MDE on current traffic:* % relative + +*Pitfall check:* +✅ Underpowered duration — adequate +✅ Cohort size — adequate +⚠️ +``` + +Wait for explicit confirmation before creating the experiment. + +## Writing style + +- Lead with the hypothesis. Every other decision flows from it. +- Use concrete numbers from real data ("baseline 4.2%, σ² = 0.040, required n ≈ 6,400/arm"), not vague guidance. +- Quote the user's MDE and metric names back so they catch typos. +- When underpowered, say so plainly and list remediations in order of cost. +- Don't moralise about peeking — switch them to sequential. +- Guardrail regressions are hard gates, not "slight concerns." + +## Related skills + +- `experiment-results` — post-launch analysis. Use after the experiment ships. +- `feature-flags` — pure rollout / kill-switch / gating without measurement. +- `create-dashboard` — live monitoring of primary + guardrail metrics during the experiment. diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/references/advanced-features.md b/plugins/mixpanel-mcp/skills/experiment-setup/references/advanced-features.md new file mode 100644 index 0000000..6c9ef41 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/references/advanced-features.md @@ -0,0 +1,103 @@ +# Advanced features + +Three optional settings most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. + +## CUPED — variance reduction + +**What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. + +**Setting:** `settings.cuped.enabled = true`, with `settings.cuped.preExposureDatePreset` choosing the pre-exposure window. + +### When to enable + +- The primary metric correlates with users' pre-exposure behaviour on the same metric. Strong correlations: revenue, engagement (events per user), retention, time-on-platform. Weak correlations: anything one-time or onboarding-specific. +- **All experiment users existed before the experiment start** — i.e., not a new-user-only cohort. CUPED needs a pre-exposure observation period; new users don't have one. +- A 2–4 week pre-exposure window is available with stable behaviour. If the metric was launched 5 days ago, CUPED has nothing to read. + +### When NOT to enable + +- New-user-only experiments. No pre-exposure data exists. CUPED gives zero variance reduction and adds noise. +- Brand-new metrics without historical data. +- Metrics where pre-exposure behaviour is not predictive of post-exposure (e.g., one-time onboarding events: the user either did or didn't complete onboarding once; pre-exposure has nothing to say about it). +- Pre-exposure window short enough that the behaviour you'd "control for" is itself a transient spike (e.g., metric just had a viral moment last week). + +### Pre-exposure window presets + +- `"2-weeks"` — fast-moving metrics with no strong weekly seasonality. +- `"4-weeks"` — most metrics with weekly seasonality (default sweet spot). +- `"60-days"` — deeply seasonal metrics like spend. +- `"90-days"` — long-cycle metrics (renewal-driven revenue, etc.). + +### What changes downstream + +- Required sample size shrinks by the variance-reduction factor. A 50% variance reduction on a primary that needed 60k per arm shrinks the target to ~30k per arm. +- The point estimate of the lift is unchanged. CUPED is a variance-reduction technique, not a bias correction; the headline lift is the same, the confidence interval is narrower. +- The post-launch interpretation step needs to know CUPED was on, because the standard error formula differs. The setting is persisted on the experiment object; the interpretation step reads it automatically. + +## Winsorization — outlier handling + +**What it does.** Caps extreme values at a percentile boundary (default 95th, i.e. cap the top 5% and bottom 5% at the 95th and 5th percentile values respectively). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. + +**Setting:** `settings.winsorization.enabled = true`, with `settings.winsorization.percentile` choosing the cap point. + +### When to enable + +- Revenue or spend metrics with whales (one customer spends 100× the median; that customer assigned to treatment is enough to swing the headline). +- Time-on-page or session-duration metrics with users who fall asleep on the page (one session at 8 hours dwarfs 10,000 sessions at 30 seconds). +- Any Gaussian-distributed metric with a heavy right tail (count metrics, event volume per user, page view counts). + +### When NOT to enable + +- Bernoulli (conversion) metrics. Capping a 0/1 outcome is meaningless; the 95th percentile of a 0/1 distribution is also 0 or 1. +- Metrics where the tail behaviour **is** the hypothesis. If the test is "did this change move whale spending?", Winsorization throws away exactly the signal you're testing for. +- Metrics already winsorized upstream (in the metric definition / data pipeline) — double-winsorization adds nothing. + +### Percentile guidance + +Default is 95 (cap top/bottom 5%). This is almost always right. Push back if the user sets `percentile < 80` — that's >20% of values being capped, which throws away too much signal. Confirm intent before launching. + +For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. 95 is the default for a reason. + +### What changes downstream + +- Variance on the affected metric drops, often substantially. Required sample size shrinks accordingly. +- The point estimate of the mean shifts toward the centre of the distribution. This is the desired behaviour; the whole point is to stop a few outliers from anchoring the estimate. +- The post-launch interpretation step reports the winsorized mean and standard error. If the team also wants to know what the un-winsorized mean did (the "did whales react?" question), they'd need a separate secondary metric without Winsorization. + +## Multiple testing correction — Bonferroni vs Benjamini-Hochberg + +Covered in detail in `references/statistical-model.md`. The short version: + +- Enable when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. +- Default to `"benjamini-hochberg"`. More powerful with correlated primaries. +- Use `"bonferroni"` when family-wise error control is required (regulatory, etc.) or when the primaries are independent. +- Set `"off"` only with a single primary and a single non-control variant. + +## Decision flowchart + +``` +Primary metric is Bernoulli (conversion rate)? +├── Yes → Winsorization OFF. +│ Does it correlate with pre-exposure behavior of existing users? +│ ├── Yes → CUPED ON (if 2-4 week pre-exposure window available, no new-user cohort) +│ └── No → CUPED OFF +└── No (continuous / count / retention) + Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? + ├── Yes → Winsorization ON (default percentile = 95) + └── No → Winsorization OFF + Does it correlate with pre-exposure behavior of existing users? + ├── Yes → CUPED ON (if 2-4 week pre-exposure window available, no new-user cohort) + └── No → CUPED OFF + +Primary count ≥ 2 OR non-control variants ≥ 2? +├── Yes → Multiple testing correction ON ("benjamini-hochberg" default; "bonferroni" for strict family-wise control) +└── No → Multiple testing correction OFF +``` + +## Common misconfigurations + +- ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. +- ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. +- ⛔ **Winsorization at percentile < 80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. +- ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/references/hypothesis-framing.md b/plugins/mixpanel-mcp/skills/experiment-setup/references/hypothesis-framing.md new file mode 100644 index 0000000..57bbd73 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/references/hypothesis-framing.md @@ -0,0 +1,101 @@ +# Hypothesis framing + +A good experiment hypothesis is a **falsifiable, directional claim with a stated mechanism, bounded in time**. All four properties matter — drop any one and the design downstream silently degrades. + +## The shape + +> **If** ``, **then** `` will ``, **because** ``. + +| Property | Test | Failure mode | +| ------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | +| **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | +| **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | + +## When the user gives you a one-liner + +Ask them to commit to five things, in order. Don't proceed until you have all five. + +1. **The change** — what's different in treatment. A specific UI string, a routing change, a price, a copy variant. Vague ("the new onboarding") is not enough; "the new onboarding which moves the free-item offer to step 1" is. +2. **The primary outcome metric** — one specific event or rate, not a domain. "Engagement" is not a metric; "weekly active users with ≥1 report created" is. +3. **The expected direction** — up or down. (Goes straight into the metric's `direction` field.) +4. **The minimum effect size that would justify shipping** — this becomes the MDE. If the user can't name one, ask: "If the lift turned out to be 0.5%, would you ship?" Their answer reveals the MDE. +5. **The mechanism** — why you expect this to work. The mechanism is what binds the metric to the change. A change to onboarding screens shouldn't be measured by Day-30 retention if no one has gotten to Day 30 yet — the mechanism would say so explicitly. + +## Mechanism → metric class + +The mechanism predicts the _kind_ of metric that should move. Use this mapping as a sanity check: + +| Mechanism flavour | Likely primary-metric class | Anti-pattern | +| -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------- | +| Reduces friction at a specific step | Step conversion rate (funnel-typed) | Headline retention metric | +| Surfaces a new option / increases discoverability | Click-through or first-use rate on the surfaced option (conversion) | Total events per user | +| Reorders information / changes salience | Time-to-task, completion rate on the salient step | Account-level revenue | +| Changes the cost of an action (price, paywall, friction) | Conversion-to-paid, refund rate, cancel rate (with `direction: "down"`) | DAU | +| Adds a new content / recommendation system | CTR on recommendations, downstream conversion | Aggregate engagement | +| Long-term retention play (referrals, loyalty) | Day-7 or Week-1 retention as leading proxy; lagging Day-30 stays a post-launch monitor, not a primary | Day-30 retention as primary on a 2-week experiment | + +When the user's mechanism and proposed metric live on different rows of this table, push back — that's the **hypothesis ↔ metric mismatch** pitfall. + +## Hypothesis ↔ metric alignment + +A hypothesis names a specific outcome. The primary metric must measure that outcome — **same population, same denominator, same timeframe**. Common misalignments: + +- Hypothesis predicts a **rate** change; primary metric is a **count** → switch to a rate metric, or use an exposure-rebalanced total. +- Hypothesis predicts effect on **paid users**; primary metric includes free users → add a cohort filter or scope the metric. +- Hypothesis predicts effect **within session**; primary metric is **per-user across sessions** → either narrow the metric or broaden the hypothesis. +- Hypothesis predicts effect **only on a new flow**; primary metric counts events that exist only in treatment → changed-denominator. The lift is artificially infinite. Pick a metric that exists for both arms. + +## When to push back + +Push back hard when: + +- The hypothesis is non-falsifiable. Until it can be tested with a yes/no answer from data, there's nothing to set up. +- The hypothesis is non-directional. The system's `direction: "up"` default is wrong for cancel / error / latency / abandon metrics; leaving it default silently flips polarity at interpretation time. +- The mechanism doesn't predict the proposed metric. Most "experiment didn't work because we measured the wrong thing" post-mortems trace back to here. +- The proposed primary is strongly lagging on the planned duration (retention as primary on a 2-week test). Suggest a leading proxy. + +When you push back, do it once with concrete language ("you said 'improve engagement' — which event do you want to move?"). If the user genuinely wants to leave the hypothesis vague, you can proceed, but log the vagueness in `description` so the post-launch step knows the test was exploratory rather than decisional. + +## Worked examples + +### ✅ Good + +> If we surface a free-item offer during onboarding step 2, then signup→activation conversion will increase by ≥3pp (currently 18%), because reducing first-action friction lowers cold-start dropout for new accounts. + +- Falsifiable: data can say "no, lift was <3pp." +- Directional: up. +- Mechanistic: first-action friction → cold-start dropout. +- Time-bounded: signup→activation is a within-session metric; readable inside any reasonable test duration. +- Mechanism predicts a conversion-class primary; signup→activation conversion fits. + +### ✅ Good (lagging hypothesis, leading proxy primary) + +> If we ship the new referral flow, then Day-30 retention will increase by ≥1.5pp, because referred users have stronger network effects. We will measure Day-7 retention as the experiment primary (historical correlation r=0.78 with Day-30) and keep Day-30 as a post-launch monitor. + +- Bounded-in-time problem is acknowledged and solved with a leading proxy. The lagging metric remains a post-launch check, not a ship gate. + +### ❌ Vague + +> Test the new onboarding. + +- No change description (which change? full redesign or one screen?). +- No outcome. +- No direction. +- No MDE. +- No mechanism. + +Coach: pull each of the five commitments out of the user before going further. + +### ❌ Non-falsifiable + +> The new dashboard will improve the user experience. + +- "Improve user experience" can't be tested. Ask: "Which specific behaviour changes if user experience is better? Engagement events per session? Time to first chart? Dashboards saved per user?" + +### ❌ Mechanism doesn't predict the metric + +> If we change the colour of the CTA button, then 30-day retention will increase by ≥2pp, because users will perceive the product as more polished. + +- Mechanism is plausible at best, but Day-30 retention is far downstream of a button-colour change. Even if the colour change does help, a 2-week experiment won't measure it. Either pick a leading proxy (click-through on the CTA) or shelf the test until you have a more credible mechanism for retention. diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/references/metric-selection.md b/plugins/mixpanel-mcp/skills/experiment-setup/references/metric-selection.md new file mode 100644 index 0000000..8ce7ea0 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/references/metric-selection.md @@ -0,0 +1,75 @@ +# Metric selection + +Each metric on an experiment serves exactly one of three roles. The hypothesis tells you which. + +## Primary metrics (1–3 max) + +The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. + +- **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. +- **Explicit `direction`.** Every primary needs `direction: "up"` or `direction: "down"`. The system defaults to `"up"`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: + - Onboarding-screen change → activation in the first session, not Week-4 retention. + - Checkout button A/B → checkout conversion, not 30-day LTV. + - Pricing-page tweak → click-through and trial start, not annualised revenue. + - When the only metric the team cares about is lagging, use a **leading proxy** with a known historical correlation to the lagging metric. The lagging metric stays a post-launch monitor, not a ship gate. +- **Prefer rates over counts** when the hypothesis is about behaviour change. "Conversion rate" is interpretable; "total conversions" conflates per-user behaviour with cohort size. + +If the user proposes a primary, sanity-check: + +- _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) +- _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). +- _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. +- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) + +## Guardrail metrics (0+, strongly recommended) + +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. + +Standard guardrails by domain — pick at least one from the row that matches the change: + +| Change targets… | Guardrail candidates | +| ------------------------------------ | ------------------------------------------------------- | +| Performance / UI / new client code | Page load time, API latency, error rate, crash rate | +| Engagement / activation / onboarding | Weekly active users, session count, Day-7 retention | +| Revenue / monetisation / pricing | ARPU, conversion-to-paid, refund rate, cancel rate | +| Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | +| Time-to-task / search / IA | Task abandonment rate, time-to-completion | + +For every guardrail, **set `direction` explicitly**. A guardrail named "errors" with default `direction: "up"` will silently let regressions slip through interpretation as "wins." + +Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor. + +## Secondary metrics (0+, diagnostic only) + +Metrics for understanding **why** the primary moved, not for the ship decision. Examples: funnel-step completions, feature sub-use rates, time-on-screen, exploratory cohort breakdowns. + +**Secondary metrics are not decisional.** Even if the user names a secondary in their hypothesis text, they cannot ship/kill on its result. If a metric matters for the decision, it must be primary or guardrail. + +> **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: +> _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ + +This is the `hypothesis_metric_mismatch` pitfall in pre-launch detection — see `references/pitfalls.md`. + +## Sanity checklist + +Run this before locking the metric set: + +- [ ] Each primary directly measures the hypothesis's predicted outcome. +- [ ] Each primary has explicit `direction` (no `null`). +- [ ] At least one guardrail covers the most likely failure mode of the change (perf for UI changes, retention for monetisation changes, etc.). +- [ ] Each guardrail has explicit `direction`. +- [ ] No metric whose denominator is created by the treatment itself (changed-denominator). +- [ ] No primary or guardrail is a strong lagging indicator on the planned experiment duration (use leading proxies; demote lagging metrics to post-launch monitors). +- [ ] Total primary count ≤ 3. +- [ ] If primary count ≥ 2 OR non-control variants ≥ 2, multiple-testing correction is on (`benjamini-hochberg` default, `bonferroni` for strict family-wise control). +- [ ] For each primary, baseline rate has been pulled from real data (not guessed). + +## Anti-patterns + +- ⛔ **No guardrails to "avoid noise."** Guardrails are the regression detection, not noise. Without them, a winning primary with a quietly regressing latency or refund-rate is a ship — and then a rollback two weeks later. +- ⛔ **Five primaries because "they're all important."** Past 3, the false-positive risk dominates. Pick the 1–3 the hypothesis actually predicts; demote the rest to secondaries. +- ⛔ **Primary = "total signups," metric = behaviour change.** A behaviour-change hypothesis needs a rate metric; total signups conflates per-user behaviour with the size of the cohort that entered the experiment. +- ⛔ **Guardrail with default `direction: "up"` on an error / cancel / latency metric.** Silently inverts the regression check. +- ⛔ **30-day retention as primary on a 2-week experiment.** Either the lagging metric can't move (no signal) or it moves on noise (false significance). Use a leading proxy. +- ⛔ **Primary metric only exists in treatment.** Changed denominator. Lift is meaningless. diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/references/pitfalls.md b/plugins/mixpanel-mcp/skills/experiment-setup/references/pitfalls.md new file mode 100644 index 0000000..dc7cc18 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/references/pitfalls.md @@ -0,0 +1,135 @@ +# Pre-launch pitfalls + +This is the catalogue of the deterministic checks the agent runs before the user creates an experiment. **Detection logic lives in the platform's pre-launch validation capability**; this document owns the prose — the _why_ behind each check — so the agent can explain the violation in human terms rather than just nagging. + +For the source-of-truth severities, thresholds, and message templates, see `ai/engine/tools/experiments/_shared/pitfall_prose.py` in `mixpanel/analytics`. When that file changes, this document changes too. + +## Triage order + +The agent surfaces pitfalls in this order: + +1. **Blockers first.** An experiment that triggers a blocker should not launch as-is. Two pitfalls today: `underpowered_duration_insufficient` and `cohort_too_small`. Both mean the experiment literally cannot reach statistical power for the configured MDE. +2. **Warnings next.** Configuration smells that would degrade interpretability or trustworthiness. Most fall here. +3. **FYIs last.** Soft nudges; not blocking even if the user ignores them. + +Within a severity tier, surface in this order (most actionable first): data-trust risks (pre-experiment bias, variance inflation) → configuration nudges (guardrails, hypothesis alignment). + +## The >5% guardrail hard-gate + +The single most important rule in the catalogue. **A 5% relative regression on any guardrail blocks ship even if the primary wins.** + +### Why 5% + +The threshold is calibrated to be tight enough to catch real degradations of user experience, revenue, or performance, and loose enough that day-to-day noise on a moderately-volatile guardrail doesn't trip it on every test. + +- Below 5%: typically within the noise band of most guardrails on a 2-week test. Tightening below 5% would generate too many false alarms. +- Above 5%: the team has implicitly traded measurable user/revenue/performance damage for headline-metric lift. That's not a ship — that's a re-design. + +### Why "hard gate" + +Guardrails are not "things to also look at." They are the **trustworthiness backstop**. A winning primary with a regressing guardrail means the change _exchanged_ something the team agreed must not regress for the headline-metric lift. If guardrails are negotiable, they aren't guardrails. + +### Why explain it to the user + +The most common reaction to a guardrail regression is "but the primary won, can't we just ship?" The agent's job is to make the trade-off explicit: + +> "Primary metric `` won by +2.3pp, but guardrail `` regressed by 7.4%. The 5% threshold exists because guardrails are the trustworthiness backstop — a winning primary with a regressing guardrail means you've traded `` for ``, which is a design choice that needs explicit sign-off, not a ship decision." + +If the team genuinely wants to make that trade, they can disable the guardrail before launch and document the decision in `description`. Don't let them silently override; force the conversation. + +--- + +## The catalogue + +Each entry lists: kind → severity → trigger condition → why it matters → what to recommend. The message templates are in `pitfall_prose.py`; reproduced inline here for context. + +### `underpowered_duration_insufficient` — blocker + +**Trigger.** Expected exposures (`exposures_per_day × planned_days × n_arms`) are less than 50% of the per-arm sample size required to detect the configured MDE at the baseline rate. + +**Why it matters.** The experiment cannot reach statistical power for this MDE no matter how clean the rest of the config is. If launched, the most likely outcome is "inconclusive" — and a non-trivial fraction of those inconclusive results will be due to noise crossing the significance threshold rather than a real effect, the winner's-curse problem. + +**Recommendation.** Extend planned duration by roughly `(n_required − expected_exposures) / exposures_per_day` days, OR relax the MDE (only ship if the lift is bigger), OR pick a higher-volume primary metric, OR enable CUPED if pre-exposure data is available (which can cut required `n` by 30–70%). + +### `cohort_too_small` — blocker + +**Trigger.** Cohort size is smaller than `num_arms × target_sample_size`. The cohort cannot supply enough eligible users. + +**Why it matters.** Same root cause as the duration blocker, different lever. Even with infinite time, the experiment will run out of eligible users before each arm reaches the per-arm target. + +**Recommendation.** Either expand the cohort to ~`num_arms × target_sample_size` eligible users (relax filters, broaden segment, extend eligibility window), or lower the per-arm target sample size to what the cohort can actually supply (and accept the larger achievable MDE that comes with it). + +### `pre_experiment_bias_likely` — warning + +**Trigger.** Retrospective A/A is enabled, at least one continuous-ish metric (continuous, retention, or funnel) is configured, AND CUPED is off. + +**Why it matters.** Pre-experiment bias is likely on metrics with seasonality or power-user skew. Without CUPED to absorb the baseline difference, post-experiment lifts will inherit it — the team will see "treatment up 2%" when the real treatment effect is 0% and the baseline difference is +2%. + +**Recommendation.** Enable CUPED with a 2–4 week pre-exposure window. CUPED specifically handles this case: it regresses out the pre-exposure baseline difference so the post-exposure lift is the actual treatment effect. + +### `high_variance_no_winsorization` — warning + +**Trigger.** At least one continuous-ish metric is configured AND Winsorization is off. + +**Why it matters.** Outliers will inflate variance and widen confidence intervals. A handful of power users can dominate the per-arm mean, swinging the headline based on which arm those users got assigned to. + +**Recommendation.** Enable Winsorization with default percentile 95. Push back if the user sets percentile <80 (that's >20% of values capped — almost always a misconfiguration). + +### `multiple_primaries_no_bonferroni` — warning + +**Trigger.** ≥2 primary metrics configured AND multiple-testing correction is off. + +**Why it matters.** Family-wise false-positive rate compounds with each additional primary. At 3 primaries the FPR is ~14.3%; at 5 it's ~22.6% — more than one in five "wins" is noise. + +**Recommendation.** Enable multiple-testing correction. Default to Benjamini-Hochberg (more powerful with correlated metrics); use Bonferroni for strict family-wise error control. The name of this pitfall is historical — the correction need not be Bonferroni specifically. + +### `underpowered_duration_marginal` — warning + +**Trigger.** Expected exposures are between 50% and 100% of the per-arm sample size required for the configured MDE. + +**Why it matters.** Marginally underpowered. The experiment might reach significance on a true effect; it might not. Either way, the lift estimate at conclusion will be wider than expected. + +**Recommendation.** Extend duration to reach 100%+ of the required sample, or accept the higher Type-II error rate (more chance of missing a real effect). Less urgent than the `_insufficient` variant. + +### `missing_guardrails` — warning + +**Trigger.** Zero guardrail metrics configured. + +**Why it matters.** Without guardrails, there's no >5% hard-gate to block a ship on a regression to user experience, revenue, or performance. The team is implicitly trusting that the primary metric captures every relevant impact — which is rarely true. + +**Recommendation.** Add at least one guardrail covering the most likely failure mode of the change. Standard choices: + +- UI change → page-load time or error rate. +- Monetisation / pricing → cancel rate or refund rate. +- Engagement change → Day-7 retention or session count. +- Performance change → error rate or crash rate. + +### `hypothesis_metric_mismatch` — warning + +**Trigger.** The hypothesis text mentions one of the canonical metric nouns (`conversion`, `retention`, `revenue`, `signup`, `engagement`, `click`, `purchase`) but no primary metric's name appears to measure that outcome. + +**Why it matters.** Soft signal — the heuristic is coarse, but it catches the common case where the user wrote "X will increase conversion" and then set the primary to "session count" or vice versa. If the user's hypothesis is about conversion, the primary should be a conversion metric. + +**Recommendation.** Phrase as a question, not a verdict: _"Your hypothesis mentions ``, but no primary metric name suggests it measures that. Should `` be replaced or supplemented with a metric that more directly tests the hypothesis?"_ + +### `primary_lacks_leading_indicator` — warning + +**Trigger.** Primary metrics include a retention-type metric (lagging by construction) AND no leading-indicator secondary (conversion or funnel type) is configured. + +**Why it matters.** A retention primary is valid but reads slowly — there may not be enough signal to interpret results before the experiment concludes. Without a leading-indicator secondary, the agent has no early-read evidence to reason from. + +**Recommendation.** Add a leading-indicator secondary metric (a conversion or funnel metric measured within the experiment runtime). The retention primary stays as the ship decision; the secondary just gives early visibility. + +--- + +## Detection vs prose + +The detection math lives in the platform's pre-launch validation capability. The prose lives here and in `pitfall_prose.py`. The two are connected by the pitfall `kind` field — the validation step reports the kind, the agent renders the message. + +This separation lets product retune phrasing (this document, `pitfall_prose.py`) without touching the detection helpers, and vice versa. When you update a threshold (e.g., the 50% / 100% bounds on the underpowered checks), update the helper math, the `_shared/pitfall_prose.py` constant, and the recommendation in this document together — the agent will quote stale numbers if any of the three drifts. + +## What's not in the catalogue (yet) + +- **Cross-test contamination** — when the same users are eligible for multiple concurrent experiments on the same surface. Hard to detect statically; usually surfaces as anomalous variance at interpretation time. +- **Novelty effect detection** — early days of the experiment show inflated treatment effect, then settle. Not a pre-launch check; lives in the post-launch interpretation skill. +- **Seasonality misalignment** — running a 2-week experiment that doesn't align to weekly cycles. Today this is detected indirectly via the duration check; a future explicit `seasonality_misaligned` pitfall is a reasonable add. diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/references/prior-experiments.md b/plugins/mixpanel-mcp/skills/experiment-setup/references/prior-experiments.md new file mode 100644 index 0000000..5a0beee --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/references/prior-experiments.md @@ -0,0 +1,81 @@ +# Prior experiments + +When a user proposes an experiment on a feature, **the first thing to do is look up prior experiments on that feature**. Skipping this leads to redundant tests, contradictory ship decisions, and wasted traffic. + +## The lookup + +Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. + +If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. + +## What to do with what you find + +### Same feature already tested and shipped + +Reference the prior result before recommending a new test. The right answer is often "don't re-run; iterate on a new hypothesis." + +> "There's a prior experiment from [date] on the same feature with a similar hypothesis: it shipped at +X% on metric Y. Re-running won't tell us anything new. What's different about the change you're proposing? Is the new hypothesis about a different sub-population, a different metric, or a different mechanism?" + +If the user does want to re-run (e.g., the population has shifted significantly, the underlying product has changed, or the prior test was clearly underpowered), proceed — but design the new test to specifically address what's different from the prior. + +### Same feature tested and killed + +Treat this as a strong prior. Ask why the user thinks the new variant will work where the prior didn't. + +> "Prior experiment [date] on the same surface killed at [-X% / inconclusive]. What's different about your change that should produce a different outcome? If the prior failed because of [mechanism], does your change address that?" + +If the user can articulate a different mechanism, run the new test. If they can't, the most likely outcome is a repeat of the prior result — discourage the test or downgrade its priority. + +### Earlier iteration of the same hypothesis + +Use the prior result to inform the new design — specifically, **baseline rates and variance estimates**. Prior data is much more reliable than guessing. + +- Pull the prior's control-arm baseline rate; use it as the baseline for the new sizing calculation. +- Pull the prior's observed variance; use it instead of estimating from scratch. +- Pull the prior's exposure rate (exposures per day per variant); use it to set a realistic duration estimate. + +This often shrinks the required sample size or shortens the planned duration. Both are wins worth surfacing. + +### Recently concluded with similar metrics + +Pull the realised exposure rate. The "expected exposures per day" the user has in mind is usually higher than what actually shows up in a real experiment on the same surface — eligibility filters, opt-outs, and bot exclusion all bite. Use the prior's actual rate, not the theoretical one. + +### Multiple prior experiments on adjacent surfaces + +Look for **patterns**, not single data points. If three prior tests on the same funnel stage all moved in the same direction by similar magnitudes, that's the realistic prior for what the new test will do. If the prior tests are noisy or contradictory, treat the new test's expected lift with more uncertainty and consider running it longer. + +## Folding prior results into the new design + +Concretely, when you have a prior result that's relevant, the setup workflow changes as follows: + +| Step | Without prior | With prior | +| -------------------------- | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | +| Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | +| Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | +| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | + +## When prior tests warn you away from testing at all + +Sometimes the prior data tells you the right answer is **don't run the experiment**: + +- The metric the user wants to move has been tested 4 times on this surface in the last year, all with inconclusive or null results, all adequately powered. The hypothesis-space is likely exhausted; suggest a different mechanism or a different surface. +- The baseline rate is so low that even the prior, well-powered tests couldn't detect anything below a 30% relative lift. The new test would inherit the same constraint. Either pick a higher-volume proxy metric or accept that the change has to be very large to be detectable. +- Recent guardrail regressions on the same surface suggest the surface is unstable; running more experiments without first fixing the trust issue is wasted traffic. + +Surface these findings as recommendations, not blockers. The user might have context the prior data doesn't capture. + +## What to record about the new design's relationship to prior tests + +In the experiment's `description` field, link to the prior experiment(s) and note how the new design differs. This becomes critical at interpretation time — the post-launch step uses the prior context to calibrate its read of the new result. + +A useful template: + +``` +Prior: tested on , result: . +This experiment differs by: . +Inherited from prior: baseline rate (X%), σ², exposure rate (N/day/variant). +``` + +This is a 30-second annotation that pays back tenfold at analysis time. diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp/skills/experiment-setup/references/routing-xp-vs-ff.md new file mode 100644 index 0000000..a38a72d --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/references/routing-xp-vs-ff.md @@ -0,0 +1,85 @@ +# XP vs FF: routing intent + +Before any setup work, decide whether the user actually wants an **experiment** (XP) or just a **feature flag** (FF). The decision is binary, but the language users use is blurry — "let's A/B test this" sometimes means "let's run a controlled experiment with a hypothesis and a stopping rule," and sometimes means "I want to ship it to 10% of users and see if anything breaks." + +## The discriminator + +| If the user wants… | Then it's a… | +| --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | +| Causal evidence — "does this change move metric X by enough to justify shipping?" | **Experiment** (XP). | +| Progressive rollout — "ship to 10%, then 50%, then 100% if nothing breaks." | **Feature flag** (FF). | +| Kill-switch — "I want to be able to turn this off instantly if it goes sideways." | **Feature flag** (FF). | +| Per-segment gating — "only show this to enterprise customers." | **Feature flag** (FF). | +| Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | +| Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | + +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when you call `create_experiment`); not every feature flag use case needs an experiment. + +## Disambiguation prompt + +When you can't tell from the user's wording, ask once, plainly: + +> "Are you trying to **measure** whether this change moves a metric (experiment), or are you rolling it out gradually / behind a flag with **no measurement criterion** (feature flag)? An experiment commits to a hypothesis, metrics, and a stopping rule; a feature flag is purely a delivery mechanism." + +Listen for these signals in the answer: + +- "I want to see if it improves X" / "if checkout conversion goes up" → experiment. +- "I want to make sure it doesn't break X" → could be either. Probe: "Is 'doesn't break' a measurable threshold, like a guardrail, or is it 'I'll watch dashboards and roll back if it's obviously bad'?" +- "I want enterprise to get it first" / "I want to roll out by region" → feature flag. +- "I just want a kill switch" → feature flag. +- "I want to ship it and prove ROI later" → ask whether the proof needs to be causal. If yes, that's an experiment, and it should be set up _before_ shipping, not after. (Post-hoc ROI claims from a flag rollout are not credible.) + +## Common ambiguous cases + +### "Ship to 10% as an experiment" + +Often this means "phased rollout, monitor metrics, ramp if nothing regresses." That's a feature flag with manual ramp logic, not an experiment. + +Ask: "Do you have a primary metric you're committing to before launch, with an MDE that decides whether to ship to 100%?" If yes, run as an experiment. If no, ship as a flag. + +### "I want to test the new pricing on enterprise customers" + +If "test" means "see how they react and decide whether to roll out," and the audience is small (a few enterprise customers), that's a **rollout**, not an experiment. Enterprise samples are usually too small to power an experiment, and the per-account variance is too high for a meaningful aggregate. + +Run as a flag, gather qualitative feedback, and decide based on the conversations — not on a p-value computed from N=4. + +### "Hold out a control while we ship to 100%" + +This is the classic "holdout experiment." Legitimate use case, but it has to be set up as an experiment up front (with a primary metric and a duration), not retroactively. After-the-fact holdout analysis suffers from selection bias and is not credible. + +If the user has already shipped to 100% and wants to "analyse the effect," there is no experiment to set up. Tell them so, and suggest a forward-looking test on the next change to the same surface. + +### "Just give me an A/B test, the simplest one" + +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. + +### "I want a feature flag but with stats" + +Now you're back to an experiment. Run the full setup workflow. + +## What changes once you've routed + +### If experiment + +Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. + +### If feature flag + +This skill stops. Hand off to the user (or to a `manage-feature-flags` skill if one exists): + +- They configure variants, targeting, and rollout percentages directly. +- No hypothesis, no MDE, no stopping rule needed. +- Mixpanel doesn't compute lift or significance on a flag — they're on their own for observation. + +Make sure the user understands the trade-off explicitly: "Choosing flag means you give up the ship/no-ship decision criterion. If later you want to claim the change worked, that claim won't have the same evidentiary weight as a properly-designed experiment." + +## Don't run an experiment when + +There are cases where an experiment is technically possible but the wrong move: + +- **Sample is too small.** Enterprise rollouts to ~10 accounts cannot power a real test. Ship as a flag and use qualitative feedback. +- **Treatment is risky/irreversible.** A real billing change with potential refunds shouldn't run as a 50/50 split — phase as a flag with conservative rollout and direct monitoring. +- **No baseline data.** Brand-new metric, brand-new feature, no historical observation. Run a 1–2 week passive observation period first, then design the experiment from real numbers. +- **Hypothesis is "let's see what happens."** No directional commitment means the test will be interpreted post-hoc, which is the same as not running an experiment. + +Suggest the alternative explicitly so the user doesn't feel rejected — "this isn't an experiment-shaped problem; here's what to do instead." diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/references/sizing.md b/plugins/mixpanel-mcp/skills/experiment-setup/references/sizing.md new file mode 100644 index 0000000..8122307 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/references/sizing.md @@ -0,0 +1,109 @@ +# Sizing the experiment + +You almost never know the right sample size by guessing. Pull the data first, then run the math. + +## The standard formula + +Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Where: + +- `σ²` = variance of the metric (depends on metric type — see below). +- `d` = MDE in the same units as the metric. + +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. + +## Variance by metric type + +- **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. +- **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. + +## Worked example + +Detecting a 5% **relative** lift on a 10% baseline conversion rate at 80% power, 95% confidence: + +``` +p = 0.10 +σ² = 0.10 × 0.90 = 0.09 +absolute MDE = 0.10 × 0.05 = 0.005 +n = 16 × 0.09 / 0.005² = 16 × 0.09 / 0.000025 = 57,600 per variant +``` + +That's ~57,600 per variant for a 5% relative lift — humbling, and surprising to most teams. Most "we'll just run it for two weeks" plans don't survive contact with this number. + +## Kohavi's inverted formula + +For most online experiments, traffic is the constraint, not patience. Pick a duration (2–4 weeks captures weekly cycles), use all available traffic in that window, then compute the **achievable MDE**: + +``` +MDE = 4σ / √n +``` + +This tells the user: "given your traffic, the smallest effect you can reliably detect is X." If that achievable MDE is larger than the lift the user actually expects, the experiment is **underpowered**. Flag immediately. + +Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." + +## Estimating the inputs from real data + +For each primary metric, before sizing, you need three numbers: + +1. **Baseline rate** — query the metric over the prior 2–4 weeks (the longer of: one full business cycle, or four weeks). Record `mean` and `variance`. Use the same event definition, segment filters, and unit-of-analysis you'll use in the experiment — a baseline computed differently from how the metric is configured in the experiment is worse than no baseline at all. +2. **Daily traffic** — query the exposure event (or whatever event qualifies users for the experiment) over the same window, grouped by day. Average to get expected exposures per day per variant. +3. **MDE the user wants** — ask explicitly. _"What's the smallest lift that would be worth shipping?"_ If they don't know, propose a 5–10% relative lift and confirm. + +From those three: + +``` +required_sample_per_variant = 16 × σ² / (baseline × MDE_relative)² +required_days = required_sample_per_variant × n_variants / daily_traffic_per_variant +``` + +If `required_days > 28` (four weeks), the experiment is **underpowered for the requested MDE on available traffic**. Tell the user. Don't wave it through. + +## Five remediations when the experiment is underpowered + +Offer these in order of cost — cheap first. + +1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. +2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. +5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. + +## Sample-size floor + +Independent of the math: never set `sampleSize` below ~350–400 per variant. Below this, the statistical machinery itself becomes unreliable — CLT breaks down, the SRM check gets noisy. The Mixpanel default of 10,000 per variant is fine for most tests; 1,000 is the practical floor; 350–400 is the absolute floor. + +If the math says `n = 50` per variant, the test is either trivially easy (the lift is huge) or the variance estimate is wrong. Sanity-check before launching at the floor. + +## Lookup table (Bernoulli, 95% conf, 80% power) + +For a Bernoulli (conversion-rate) primary metric at 95% confidence, 80% power, two-sided test, MDE expressed as a **relative** lift on the baseline: + +| Baseline rate | MDE = 5% relative | MDE = 10% relative | MDE = 20% relative | +| ------------- | ----------------- | ------------------ | ------------------ | +| 1% | ~633k / variant | ~158k / variant | ~40k / variant | +| 5% | ~122k / variant | ~31k / variant | ~7.6k / variant | +| 10% | ~58k / variant | ~14k / variant | ~3.6k / variant | +| 25% | ~19k / variant | ~4.8k / variant | ~1.2k / variant | +| 50% | ~6.4k / variant | ~1.6k / variant | ~400 / variant | + +Use this for quick sanity-checking. Always confirm with a query against actual baseline data — these are illustrative. + +## Sample-size growth with variants + +For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. + +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. + +## Duration considerations + +- **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. +- **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp/skills/experiment-setup/references/statistical-model.md b/plugins/mixpanel-mcp/skills/experiment-setup/references/statistical-model.md new file mode 100644 index 0000000..6158008 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-setup/references/statistical-model.md @@ -0,0 +1,100 @@ +# Statistical model + +Once required sample size and acceptable duration are known, two settings are left: `settings.testingModel` and `settings.endCondition`. Plus the two adjacent settings that change how those tests are interpreted: `settings.confidenceLevel` and `settings.multipleTestingCorrection`. + +## Testing model: sequential vs frequentist + +**Default to `sequential`** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. + +### Pick `sequential` when + +- The user expects a **large lift** and wants to confirm or reject the hypothesis quickly. Sequential lets you stop the moment significance is reached — often days or weeks before a frequentist target. +- The user wants to check results before the experiment ends and act on them (early-stop on a clear winner). +- The expected effect size is uncertain (could be huge, could be tiny). Sequential adapts; frequentist needs you to commit to one MDE up front. +- The team will look at intermediate results regardless. Sequential prevents peeking from inflating false positives. +- The user is comfortable with slightly more complex stopping rules ("stop when the test-statistic crosses the boundary," not "stop when n reaches N"). + +### Pick `frequentist` when + +- The user is hunting for a **very small lift** (e.g. 1–2% relative on a high-volume metric). Frequentist's fixed-sample design is statistically more efficient at the margin and avoids the early-stop boundary inflation that costs power on tiny effects. +- The team is comfortable waiting for the full sample before checking results — no peeking. +- The team prefers wider industry familiarity ("we used a t-test"). +- The user wants the simplest reportable statistics (a single p-value and confidence interval at the end). +- The team has internal training / tooling that assumes frequentist. + +### The "I want to peek with frequentist" trap + +The most common request is "I want frequentist, but I also want to look at the results during the test." This inflates the false-positive rate enormously — naive peeking on a frequentist test at 5 evenly-spaced check-ins pushes the family-wise α from 5% to ~14%. + +Switch them to sequential. Sequential's whole point is making peeking safe. + +If the user insists on frequentist + peeking (some teams do, for tooling reasons), document the decision in `description` so the interpretation step later knows the reported p-values overstate confidence. + +## End condition: sample_size vs days + +### Pick `sample_size` when + +- The team has a target MDE and wants the experiment to stop the moment the required sample is reached. Adaptive duration. +- Daily traffic is highly variable. Sample-size-based ends absorb the variability; date-based ends don't. +- There's no strong seasonality in the primary metric that would bias a mid-cycle stop. + +### Pick `days` when + +- The primary metric has **strong weekly (or other periodic) seasonality**. Pin the duration to a multiple of the seasonal cycle so each variant sees the same mix of high- and low-traffic periods. + - A common pattern: customers with strong weekday/weekend behaviour shifts run all experiments in 1-week increments (or 2 weeks for a stricter check) to fully capture each cycle. + - A `sample_size` end can fire mid-cycle and produce biased results in this case. +- The team has a fixed business window (e.g. "we want to ship by end of quarter"). +- The team has historically struggled with experiments running indefinitely. +- The hypothesis specifically requires a calendar window (e.g. a holiday-season test). + +### Combinations + +All four combinations are valid. The one customers most often miss is **`frequentist + days`** — some teams prefer time-based experiments for operational reasons even when running frequentist tests. Don't flag this as a misconfiguration. + +The one that's actually wrong is **`frequentist + sample_size + peeking`** — that's the "peeking trap" above. Surface it; switch them to sequential. + +## Confidence level + +Default `settings.confidenceLevel: 0.95` (α = 0.05). Change only with intent. + +- **`0.99`** — for high-stakes irreversible ships (e.g. billing changes, deletion-flow changes, anything regulatory). Higher false-negative cost; accept it. Document the reason in `description`. +- **`0.90`** — for low-stakes exploratory tests where speed matters more than rigour. Acknowledge the inflated false-positive rate to the user explicitly: at α = 0.10, one in ten "wins" is noise. + +Any change away from 0.95 belongs in `description`. The post-launch interpretation step uses this field to read the result correctly; without it, a "win" at 0.90 looks the same as a "win" at 0.95. + +## Multiple testing correction + +Enable when `len(primary_metrics) ≥ 2` OR `len(non_control_variants) ≥ 2`. Without correction, the family-wise false-positive rate compounds: + +| Primaries | Non-control variants | Family-wise FPR at per-test α = 0.05 | +| --------: | -------------------: | -----------------------------------: | +| 1 | 1 | 5.0% | +| 2 | 1 | ~9.75% | +| 3 | 1 | ~14.3% | +| 5 | 1 | ~22.6% | +| 5 | 2 | ~40.1% | +| 5 | 3 | ~53.7% | + +The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. + +Two methods are available: + +- **`"bonferroni"`** — divides α by the number of tests (`n_primary × n_non_control_variants`). Simple and conservative. Guarantees the family-wise error rate stays below α, but can be overly strict when many primary metrics are correlated, hurting power. +- **`"benjamini-hochberg"`** — controls the **false discovery rate** (FDR) instead of the family-wise error rate. Ranks all primary-metric p-values and applies progressively looser thresholds. More powerful than Bonferroni when there are many primary metrics, especially when some have real effects. Preferred when the user has 3+ primaries or correlated metrics. + +**Default to `"benjamini-hochberg"`** for most experiments — less conservative, better suited to typical designs with correlated metrics. Use `"bonferroni"` when: + +- The user needs strict family-wise error control (regulatory, high-stakes decisions where any single false positive is unacceptable). +- The primary metrics are independent (no shared drivers / overlapping populations), in which case Bonferroni's conservatism is not a real cost. +- The team explicitly asks for the simplest method to defend in a review. + +Set `settings.multipleTestingCorrection: "off"` **only** when there's a single primary and a single non-control variant. + +## Power vs significance trade-off + +When the user pushes you on `confidenceLevel`: + +- Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." +- Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. + +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric.