⚡ Promptolis Original · Data & Analytics

🧪 A/B Test Design Rigor — Statistical Power + Proper Experimentation

The structured A/B test design covering sample size calculation, statistical significance, common pitfalls (peeking, multi-comparison, simpson's paradox), test duration, and the rigor framework that distinguishes real conclusions from noise-chasing.

⏱️ 2 hours per test design 🤖 ~90 seconds in Claude 🗓️ Updated 2026-04-20

Why this is epic

Most A/B tests are poorly designed — insufficient sample size, premature conclusions, no statistical rigor. Result: false positives, bad decisions, wasted experiment bandwidth. This Original produces rigorous design: hypothesis, power calculation, proper analysis.

Names the 7 A/B test failures (insufficient power, peeking, multi-comparison, seasonality, Simpson's paradox, selection bias, outcome contamination).

Produces complete test design: hypothesis, metrics, sample size, duration, analysis plan. Based on statistical best practices.

The prompt

Promptolis Original · Copy-ready
<role> You are an experimentation specialist with 10 years of A/B testing experience. You've designed 500+ tests at SaaS + DTC companies. You understand statistics, product, + common pitfalls. You are direct. You will name when sample size insufficient, when peeking is happening, when multi-comparison not adjusted, and when conclusions overreach. </role> <principles> 1. Sample size calculation FIRST. 2. Pre-registered primary metric. 3. No peeking (sequential testing if needed). 4. Minimum 7 days (capture weekly cycles). 5. Multiple comparison correction. 6. Simpson's paradox check. 7. Effect size matters. 8. Pre-registered analysis plan. </principles> <input> <test-hypothesis>{specific hypothesis}</test-hypothesis> <variants>{control + treatments}</variants> <primary-metric>{what you're measuring}</primary-metric> <secondary-metrics>{additional metrics}</secondary-metrics> <baseline>{current performance on metric}</baseline> <expected-effect>{how much improvement you expect}</expected-effect> <traffic-available>{users/sessions per day}</traffic-available> <statistical-requirements>{confidence + power}</statistical-requirements> </input> <output-format> # A/B Test Design: [Hypothesis] ## Hypothesis + Metrics Specific + testable. ## Sample Size Calculation With assumptions. ## Test Duration Time needed. ## Randomization Strategy How users allocated. ## Analysis Plan Pre-registered. ## Guardrails What must not break. ## Common Pitfalls Specific To This Test Watch for these. ## Decision Framework How to interpret results. ## Key Takeaways 5 bullets. </output-format> <auto-intake> If input incomplete: ask for hypothesis, variants, primary metric, secondary, baseline, expected effect, traffic, requirements. </auto-intake> Now, design:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<test-hypothesis>Changing the 'Start Free Trial' button from blue to orange on pricing page will increase trial signup conversion.</test-hypothesis>
<variants>Control: blue button (current). Treatment: orange button.</variants>
<primary-metric>Trial signup conversion rate (from pricing page visit).</primary-metric>
<secondary-metrics>Time to trial signup (sec), scroll depth, bounce rate.</secondary-metrics>
<baseline>Current blue button converts 4.2%.</baseline>
<expected-effect>Hope for 5%+ absolute (19% relative lift). Minimum meaningful detectable: 2% relative lift.</expected-effect>
<traffic-available>2,500 pricing page visits per day.</traffic-available>
<statistical-requirements>95% confidence, 80% power.</statistical-requirements>
🤖 Output

A/B Test Design: Button Color Change — 4.2% → Hopefully 5%+ Conversion

An A/B test design produces rigorous experimentation preventing false conclusions. Your test — button color on pricing page — is classic but needs proper design to avoid common pitfalls. This design: sample size calculation, proper duration, analysis plan.

Hypothesis + Metrics

Hypothesis (pre-registered):

H0 (null): Orange button conversion rate = Blue button conversion rate.

H1 (alternative): Orange button conversion rate ≠ Blue button conversion rate.

Two-tailed test (orange could be better OR worse).

Primary Metric

Trial signup conversion rate: trial signups / pricing page visits

  • Numerator: trial signups attributed to pricing page session
  • Denominator: unique visits to pricing page (deduplicated)
  • Attribution window: 24 hours from pricing page view
Secondary Metrics (tracked but not decision-making)
  • Time to trial signup from pricing page (seconds)
  • Bounce rate on pricing page
  • Scroll depth
  • Trial-to-paid conversion (longer-term)

Why only primary for decision: prevents p-hacking across many metrics. One outcome defines success.

Sample Size Calculation

Inputs:

  • Baseline conversion: 4.2%
  • Minimum detectable effect (MDE): 2% relative lift (= 4.28% vs. 4.2%)
  • Significance: α = 0.05 (95% confidence)
  • Power: 1-β = 0.80 (80% power)
  • Test type: two-proportion test, two-tailed

Calculation (via online calculator or formula):

For MDE of 2% relative lift from 4.2% baseline:

  • Required sample size per variant: ~69,400 visits
  • Total for test: ~138,800 visits

Wait — 2% relative lift is TINY. Re-check meaningful effect:

For MDE of 10% relative lift (4.2% → 4.62%):

  • Required per variant: ~3,200 visits
  • Total: ~6,400 visits

For MDE of 20% relative lift (4.2% → 5.04%):

  • Required per variant: ~850 visits
  • Total: ~1,700 visits

Your 'minimum meaningful' of 2% is statistically very small — requires massive sample.

Recommended: set MDE at 10% relative lift (meaningful business outcome) = needs 6,400 total visits.

Test Duration

At 2,500 pricing page visits per day:

  • For 6,400 visits (MDE 10%): ~3 days technical minimum
  • For 138,800 visits (MDE 2%): ~55 days

BUT minimum duration considerations:

Weekly cyclicality: capture full week

  • B2B traffic typically higher Tue-Thu, lower weekends
  • Must include weekend-weekday patterns
  • Minimum 7 days even if sample size reached sooner

Monthly cyclicality: for tests <14 days, can miss monthly patterns

Recommended duration:

  • Minimum: 14 days (captures 2 weekly cycles + enough sample for MDE 10%)
  • Maximum: 28 days (don't overthink)

If test runs 28 days + no significance: conclude no meaningful effect.

Randomization Strategy

Unit of randomization: user (not visit).

Why user, not visit:

  • Same user visiting multiple times should see same variant
  • Prevents contamination (user confusion)
  • More statistically clean

Randomization:

  • 50/50 split (control vs. treatment)
  • Stable assignment: same user gets same variant on return
  • Use user ID (if logged in) or persistent cookie ID (if not)
  • Server-side randomization preferred over client-side (avoids flash-of-wrong-version)

Exclusion criteria:

  • Internal users (employees)
  • Bot traffic
  • Users from specific geographies if tests localized

Analysis Plan (Pre-Registered)

Analysis timing:

  • Day 1-13: MONITOR only (check for major issues, bugs, sample imbalance)
  • Day 14+: first analysis allowed
  • Day 28: final analysis + decision

Analysis methodology:

Primary analysis:

  • Two-proportion z-test
  • Treatment vs. control conversion rates
  • 95% confidence interval on relative lift
  • P-value reporting (but not primary decision criterion)

Decision rule:

  • If p < 0.05 AND confidence interval above 0: treatment wins
  • If p < 0.05 AND confidence interval below 0: control wins
  • If p ≥ 0.05: no significant difference — keep control (status quo bias safe)

Secondary analysis:

  • Guardrail metrics (bounce rate, time-on-page)
  • No segment deep-dives unless pre-specified
Peeking Prevention

No daily monitoring of primary metric.

Reasons:

  • Repeated looking inflates false-positive rate
  • Sequential testing methodologies exist but require specific statistical approaches

Weekly check-in: monitor for technical issues only, not outcome.

Guardrails (Must Not Break)

Defined BEFORE test launches:

1. Bounce rate: must not increase >5% on treatment

2. Time-on-page: must not decrease >10%

3. Subsequent trial-to-paid conversion: must not decrease >3%

4. Technical errors: must not increase

5. Brand complaints: monitored

If any guardrail violated → stop test, investigate, iterate.

Example trigger: orange button attracts attention but users then bounce (bad UX) → stop test even if trial signup higher (net-negative).

Common Pitfalls Specific To This Test

Pitfall 1: Insufficient Sample Size
  • Button color tests often have small effects
  • 2% relative lift needs massive sample
  • Solution: set realistic MDE (10%+) OR accept long duration
Pitfall 2: Seasonality
  • Pricing page visits vary by day of week
  • 14-day minimum captures patterns
Pitfall 3: Contamination
  • Users see different variants across sessions (caching issues)
  • Solution: user-level randomization + server-side rendering
Pitfall 4: Confounding Changes
  • Other changes to pricing page during test invalidate results
  • Solution: freeze pricing page during test, log any other changes
Pitfall 5: Multiple Comparison
  • Looking at too many secondary metrics
  • Solution: one primary metric, secondary metrics are context only
Pitfall 6: Segment-Level Claims
  • Treatment works for mobile but not desktop, or vice versa
  • If not pre-specified, don't make segment claims
  • Future test to explore if patterns emerge
Pitfall 7: Color Doesn't Mean Much
  • Button color often has tiny effect
  • 1-2% relative lift typical
  • Manage expectations — don't over-interpret small effects

Decision Framework

If treatment significantly wins (p<0.05, CI above 0):
  • Ship it. Full rollout.
  • Document learnings
  • Consider why color mattered (if at all meaningfully)
If control significantly wins:
  • Revert. Don't ship orange.
  • Investigate why
If no significance (most likely outcome):
  • Conclude no meaningful effect.
  • Keep control (status quo)
  • Document: 'Button color change from blue to orange had no statistically significant effect on trial signup conversion rate.'
  • Don't claim victory on p=0.08
If guardrail violated:
  • Stop test. Investigate.
  • Don't ship + don't continue test.
  • Iterate on design.

Key Takeaways

  • Sample size: for realistic MDE of 10% relative lift, need 6,400 total visits (~3-5 days technical + 14-day minimum for weekly cycles). If expecting <5% effect, this test may not be worthwhile (sample size requirements huge).
  • Pre-register: one primary metric (trial signup conversion), 5 guardrails, 28-day maximum duration, user-level randomization. Prevents p-hacking + post-hoc claims.
  • No peeking. Day 14 minimum + Day 28 maximum for analysis. Early peeks inflate false-positive rate significantly.
  • Button color tests typically have small effects (1-2% relative lift). Set realistic expectations. Don't over-interpret non-significance as 'need more data.'
  • Most likely outcome: no significant difference. That's OK — keep blue. Small effects aren't worth implementing. Only ship clear winners.

Common use cases

  • Product teams running feature tests
  • Marketing testing campaigns
  • Growth teams optimizing conversion
  • UX teams testing interface changes
  • Pricing experiments

Best AI model for this

Claude Opus 4 or Sonnet 4.5. A/B testing requires statistics + product + behavioral understanding. Top-tier reasoning matters.

Pro tips

  • Sample size calculation FIRST. Without adequate power, conclusions unreliable.
  • Define primary metric + guardrails. Don't p-hack across metrics.
  • No peeking. Check at pre-agreed duration, not daily.
  • Minimum duration captures weekly cyclicality (7 days minimum).
  • Multiple comparison: Bonferroni correction or sequential testing.
  • Simpson's paradox: aggregate vs. segment results differ. Investigate.
  • Effect size matters more than statistical significance.
  • Pre-register analysis. Decide variants + metrics before running.

Customization tips

  • Sample size calculator online (Evan Miller's, Optimizely's). Use multiple + confirm convergent answers.
  • For larger traffic sites, can run more tests with smaller MDE. Smaller sites: only test big hypotheses.
  • Pre-register tests in shared document. Prevents post-hoc rationalization of negative results.
  • Post-test documentation: what you tested, why, result, learning. Build institutional knowledge.
  • Don't run >3 simultaneous tests on same surface. Interaction effects confound results.

Variants

Feature Test

For product feature experiments.

Marketing Campaign

For ad/email/landing page tests.

Pricing Experiment

Higher-stakes pricing tests.

UX/Design Test

Interface + flow changes.

Frequently asked questions

How do I use the A/B Test Design Rigor — Statistical Power + Proper Experimentation prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with A/B Test Design Rigor — Statistical Power + Proper Experimentation?

Claude Opus 4 or Sonnet 4.5. A/B testing requires statistics + product + behavioral understanding. Top-tier reasoning matters.

Can I customize the A/B Test Design Rigor — Statistical Power + Proper Experimentation prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Sample size calculation FIRST. Without adequate power, conclusions unreliable.; Define primary metric + guardrails. Don't p-hack across metrics.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals