A/B Test Design Rigor — Statistical Power + Proper Experi…

⚡ Quick Answer

A/B Test Design Rigor — Statistical Power + Proper Experimentation — The structured A/B test design covering sample size calculation, statistical significance, common pitfalls (peeking, multi-comparison… Setup: 2 hours per test design · Best AI: Claude Opus 4 or Sonnet 4.5. A/B testing requires statistics + product + behavioral understanding. Top-tier reasoning matters. · Cost: Free, MIT-licensed.

Why this is epic

Most A/B tests are poorly designed — insufficient sample size, premature conclusions, no statistical rigor. Result: false positives, bad decisions, wasted experiment bandwidth. This Original produces rigorous design: hypothesis, power calculation, proper analysis.

Names the 7 A/B test failures (insufficient power, peeking, multi-comparison, seasonality, Simpson's paradox, selection bias, outcome contamination).

Produces complete test design: hypothesis, metrics, sample size, duration, analysis plan. Based on statistical best practices.

📑 Page navigation + Key Takeaways Click to expand

📌 Key Takeaways

What it is: The structured A/B test design covering sample size calculation, statistical significance, common pitfalls (peeking, multi-comparison…
Best for: Product teams running feature tests
Time investment: 2 hours per test design setup, ~90 seconds in Claude output
Recommended AI model: Claude Opus 4 or Sonnet 4.5. A/B testing requires statistics + product + behavioral understanding. Top-tier reasoning matters.
Cost: Free forever — MIT-licensed, no signup, no paywall

⚙️ At a glance

Category:: Data & Analytics
Setup time:: 2 hours per test design
Output time:: ~90 seconds in Claude
Best AI model:: Claude Opus 4 or Sonnet 4.5. A/B testing requires statistics + product + behavioral understanding. Top-tier reasoning matters.
License:: MIT (free commercial use)
Last reviewed:: 2026-07-06

📊 Promptolis Original vs generic AI prompts Click to expand

Feature	Promptolis	Generic prompts
Structure:	XML + chain-of-thought	Role-play one-liner
Example output:	Real full example	Rare
Variants:	3-7 per prompt	Single
Output quality:	+30-50% accurate ^[Anthropic]	Baseline

On the other hand, generic prompts work fine for simple lookups. Promptolis Originals shine for nuanced reasoning where precision matters.

The prompt

Promptolis Original · Copy-ready

<role> You are an experimentation specialist with 10 years of A/B testing experience. You've designed 500+ tests at SaaS + DTC companies. You understand statistics, product, + common pitfalls. You are direct. You will name when sample size insufficient, when peeking is happening, when multi-comparison not adjusted, and when conclusions overreach. </role> <principles> 1. Sample size calculation FIRST. 2. Pre-registered primary metric. 3. No peeking (sequential testing if needed). 4. Minimum 7 days (capture weekly cycles). 5. Multiple comparison correction. 6. Simpson's paradox check. 7. Effect size matters. 8. Pre-registered analysis plan. </principles> <input> <test-hypothesis>{specific hypothesis}</test-hypothesis> <variants>{control + treatments}</variants> <primary-metric>{what you're measuring}</primary-metric> <secondary-metrics>{additional metrics}</secondary-metrics> <baseline>{current performance on metric}</baseline> <expected-effect>{how much improvement you expect}</expected-effect> <traffic-available>{users/sessions per day}</traffic-available> <statistical-requirements>{confidence + power}</statistical-requirements> </input> <output-format> # A/B Test Design: [Hypothesis] ## Hypothesis + Metrics Specific + testable. ## Sample Size Calculation With assumptions. ## Test Duration Time needed. ## Randomization Strategy How users allocated. ## Analysis Plan Pre-registered. ## Guardrails What must not break. ## Common Pitfalls Specific To This Test Watch for these. ## Decision Framework How to interpret results. ## Key Takeaways 5 bullets. </output-format> <auto-intake> If input incomplete: ask for hypothesis, variants, primary metric, secondary, baseline, expected effect, traffic, requirements. </auto-intake> Now, design:

0 copies

🚀 Open in ChatGPT ✨ Open in Claude 💎 Open in Gemini

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input

<test-hypothesis>Changing the 'Start Free Trial' button from blue to orange on pricing page will increase trial signup conversion.</test-hypothesis>
<variants>Control: blue button (current). Treatment: orange button.</variants>
<primary-metric>Trial signup conversion rate (from pricing page visit).</primary-metric>
<secondary-metrics>Time to trial signup (sec), scroll depth, bounce rate.</secondary-metrics>
<baseline>Current blue button converts 4.2%.</baseline>
<expected-effect>Hope for 5%+ absolute (19% relative lift). Minimum meaningful detectable: 2% relative lift.</expected-effect>
<traffic-available>2,500 pricing page visits per day.</traffic-available>
<statistical-requirements>95% confidence, 80% power.</statistical-requirements>

🤖 Output

A/B Test Design: Button Color Change — 4.2% → Hopefully 5%+ Conversion

An A/B test design produces rigorous experimentation preventing false conclusions. Your test — button color on pricing page — is classic but needs proper design to avoid common pitfalls. This design: sample size calculation, proper duration, analysis plan.

Hypothesis + Metrics

Hypothesis (pre-registered):

H0 (null): Orange button conversion rate = Blue button conversion rate.

H1 (alternative): Orange button conversion rate ≠ Blue button conversion rate.

Two-tailed test (orange could be better OR worse).

Primary Metric

Trial signup conversion rate: trial signups / pricing page visits

Numerator: trial signups attributed to pricing page session
Denominator: unique visits to pricing page (deduplicated)
Attribution window: 24 hours from pricing page view

Secondary Metrics (tracked but not decision-making)

Time to trial signup from pricing page (seconds)
Bounce rate on pricing page
Scroll depth
Trial-to-paid conversion (longer-term)

Why only primary for decision: prevents p-hacking across many metrics. One outcome defines success.

Sample Size Calculation

Inputs:

Baseline conversion: 4.2%
Minimum detectable effect (MDE): 2% relative lift (= 4.28% vs. 4.2%)
Significance: α = 0.05 (95% confidence)
Power: 1-β = 0.80 (80% power)
Test type: two-proportion test, two-tailed

Calculation (via online calculator or formula):

For MDE of 2% relative lift from 4.2% baseline:

Required sample size per variant: ~69,400 visits
Total for test: ~138,800 visits

Wait — 2% relative lift is TINY. Re-check meaningful effect:

For MDE of 10% relative lift (4.2% → 4.62%):

Required per variant: ~3,200 visits
Total: ~6,400 visits

For MDE of 20% relative lift (4.2% → 5.04%):

Required per variant: ~850 visits
Total: ~1,700 visits

Your 'minimum meaningful' of 2% is statistically very small — requires massive sample.

Recommended: set MDE at 10% relative lift (meaningful business outcome) = needs 6,400 total visits.

Test Duration

At 2,500 pricing page visits per day:

For 6,400 visits (MDE 10%): ~3 days technical minimum
For 138,800 visits (MDE 2%): ~55 days

BUT minimum duration considerations:

Weekly cyclicality: capture full week

B2B traffic typically higher Tue-Thu, lower weekends
Must include weekend-weekday patterns
Minimum 7 days even if sample size reached sooner

Monthly cyclicality: for tests <14 days, can miss monthly patterns

Recommended duration:

Minimum: 14 days (captures 2 weekly cycles + enough sample for MDE 10%)
Maximum: 28 days (don't overthink)

If test runs 28 days + no significance: conclude no meaningful effect.

Randomization Strategy

Unit of randomization: user (not visit).

Why user, not visit:

Same user visiting multiple times should see same variant
Prevents contamination (user confusion)
More statistically clean

Randomization:

50/50 split (control vs. treatment)
Stable assignment: same user gets same variant on return
Use user ID (if logged in) or persistent cookie ID (if not)
Server-side randomization preferred over client-side (avoids flash-of-wrong-version)

Exclusion criteria:

Internal users (employees)
Bot traffic
Users from specific geographies if tests localized

Analysis Plan (Pre-Registered)

Analysis timing:

Day 1-13: MONITOR only (check for major issues, bugs, sample imbalance)
Day 14+: first analysis allowed
Day 28: final analysis + decision

Analysis methodology:

Primary analysis:

Two-proportion z-test
Treatment vs. control conversion rates
95% confidence interval on relative lift
P-value reporting (but not primary decision criterion)

Decision rule:

If p < 0.05 AND confidence interval above 0: treatment wins
If p < 0.05 AND confidence interval below 0: control wins
If p ≥ 0.05: no significant difference — keep control (status quo bias safe)

Secondary analysis:

Guardrail metrics (bounce rate, time-on-page)
No segment deep-dives unless pre-specified

Peeking Prevention

No daily monitoring of primary metric.

Reasons:

Repeated looking inflates false-positive rate
Sequential testing methodologies exist but require specific statistical approaches

Weekly check-in: monitor for technical issues only, not outcome.

Guardrails (Must Not Break)

Defined BEFORE test launches:

1. Bounce rate: must not increase >5% on treatment

2. Time-on-page: must not decrease >10%

3. Subsequent trial-to-paid conversion: must not decrease >3%

4. Technical errors: must not increase

5. Brand complaints: monitored

If any guardrail violated → stop test, investigate, iterate.

Example trigger: orange button attracts attention but users then bounce (bad UX) → stop test even if trial signup higher (net-negative).

Common Pitfalls Specific To This Test

Pitfall 1: Insufficient Sample Size

Button color tests often have small effects
2% relative lift needs massive sample
Solution: set realistic MDE (10%+) OR accept long duration

Pitfall 2: Seasonality

Pricing page visits vary by day of week
14-day minimum captures patterns

Pitfall 3: Contamination

Users see different variants across sessions (caching issues)
Solution: user-level randomization + server-side rendering

Pitfall 4: Confounding Changes

Other changes to pricing page during test invalidate results
Solution: freeze pricing page during test, log any other changes

Pitfall 5: Multiple Comparison

Looking at too many secondary metrics
Solution: one primary metric, secondary metrics are context only

Pitfall 6: Segment-Level Claims

Treatment works for mobile but not desktop, or vice versa
If not pre-specified, don't make segment claims
Future test to explore if patterns emerge

Pitfall 7: Color Doesn't Mean Much

Button color often has tiny effect
1-2% relative lift typical
Manage expectations — don't over-interpret small effects

Decision Framework

If treatment significantly wins (p<0.05, CI above 0):

Ship it. Full rollout.
Document learnings
Consider why color mattered (if at all meaningfully)

If control significantly wins:

Revert. Don't ship orange.
Investigate why

If no significance (most likely outcome):

Conclude no meaningful effect.
Keep control (status quo)
Document: 'Button color change from blue to orange had no statistically significant effect on trial signup conversion rate.'
Don't claim victory on p=0.08

If guardrail violated:

Stop test. Investigate.
Don't ship + don't continue test.
Iterate on design.

Key Takeaways

Sample size: for realistic MDE of 10% relative lift, need 6,400 total visits (~3-5 days technical + 14-day minimum for weekly cycles). If expecting <5% effect, this test may not be worthwhile (sample size requirements huge).
Pre-register: one primary metric (trial signup conversion), 5 guardrails, 28-day maximum duration, user-level randomization. Prevents p-hacking + post-hoc claims.
No peeking. Day 14 minimum + Day 28 maximum for analysis. Early peeks inflate false-positive rate significantly.
Button color tests typically have small effects (1-2% relative lift). Set realistic expectations. Don't over-interpret non-significance as 'need more data.'
Most likely outcome: no significant difference. That's OK — keep blue. Small effects aren't worth implementing. Only ship clear winners.

📋 How to use this prompt (4 steps · under 60 seconds) Click to expand

1 Copy the prompt above. Click "Copy prompt". XML-structured prompt now on clipboard.
2 Open ChatGPT, Claude, or Gemini. One-click launch above. Recommended: Claude Opus 4 or Sonnet 4.5. A/B testing requires statistics + product + behavioral understanding. Top-tier reasoning matters..
3 Paste + fill placeholders. Replace {curly braces} with your context. Specificity = quality.
4 Run + iterate. Setup: 2 hours per test design. Output: ~90 seconds in Claude.

Common use cases

Product teams running feature tests
Marketing testing campaigns
Growth teams optimizing conversion
UX teams testing interface changes
Pricing experiments

Best AI model for this

Claude Opus 4 or Sonnet 4.5. A/B testing requires statistics + product + behavioral understanding. Top-tier reasoning matters.

Pro tips

Sample size calculation FIRST. Without adequate power, conclusions unreliable.
Define primary metric + guardrails. Don't p-hack across metrics.
No peeking. Check at pre-agreed duration, not daily.
Minimum duration captures weekly cyclicality (7 days minimum).
Multiple comparison: Bonferroni correction or sequential testing.
Simpson's paradox: aggregate vs. segment results differ. Investigate.
Effect size matters more than statistical significance.
Pre-register analysis. Decide variants + metrics before running.

Customization tips

Sample size calculator online (Evan Miller's, Optimizely's). Use multiple + confirm convergent answers.
For larger traffic sites, can run more tests with smaller MDE. Smaller sites: only test big hypotheses.
Pre-register tests in shared document. Prevents post-hoc rationalization of negative results.
Post-test documentation: what you tested, why, result, learning. Build institutional knowledge.
Don't run >3 simultaneous tests on same surface. Interaction effects confound results.

Variants

Feature Test

For product feature experiments.

Marketing Campaign

For ad/email/landing page tests.

Pricing Experiment

Higher-stakes pricing tests.

UX/Design Test

Interface + flow changes.

Frequently asked questions

Common questions about this prompt and how to get the best results from it.

How do I use the A/B Test Design Rigor — Statistical Power + Proper Experimentation prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with A/B Test Design Rigor — Statistical Power + Proper Experimentation?

Claude Opus 4 or Sonnet 4.5. A/B testing requires statistics + product + behavioral understanding. Top-tier reasoning matters.

Can I customize the A/B Test Design Rigor — Statistical Power + Proper Experimentation prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Sample size calculation FIRST. Without adequate power, conclusions unreliable.; Define primary metric + guardrails. Don't p-hack across metrics.

What does it cost to use this prompt?

The prompt itself is free, MIT-licensed, with no email signup required. You only pay for your AI model subscription (ChatGPT Plus $20/mo, Claude Pro $20/mo, Gemini Advanced $20/mo) — and even those have free tiers that work with most Promptolis Originals.

How is this different from PromptBase or PromptHero?

PromptBase sells prompts in a marketplace ($2-15 each). PromptHero focuses on image-generation prompts. Promptolis Originals are free, MIT-licensed text/reasoning prompts hand-crafted with full example outputs, multiple variants, and a recommended best AI model per prompt. We don't sell anything.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals

P

Curated by Promptolis Editorial · Last reviewed 2026-07-06

Editorial process + credentials ▼

Credentials: Independent prompt-engineering team since 2026. Sister projects: SeoScore.tools and 9bench.com. Meet the team →

Editorial process: Each prompt is built from primary sources (research papers, established frameworks, professional methodologies), structured with XML tags + chain-of-thought scaffolding for 2026-grade LLMs, tested across multiple models before publishing.

🧪 A/B Test Design Rigor — Statistical Power + Proper Experimentation

Why this is epic

📌 Key Takeaways

📑 On this page

⚙️ At a glance

The prompt

Example: input → output

A/B Test Design: Button Color Change — 4.2% → Hopefully 5%+ Conversion

Hypothesis + Metrics

Hypothesis (pre-registered):

Primary Metric

Secondary Metrics (tracked but not decision-making)

Sample Size Calculation

Test Duration

Randomization Strategy

Analysis Plan (Pre-Registered)

Peeking Prevention

Guardrails (Must Not Break)

Common Pitfalls Specific To This Test

Pitfall 1: Insufficient Sample Size

Pitfall 2: Seasonality

Pitfall 3: Contamination

Pitfall 4: Confounding Changes

Pitfall 5: Multiple Comparison

Pitfall 6: Segment-Level Claims

Pitfall 7: Color Doesn't Mean Much

Decision Framework

If treatment significantly wins (p<0.05, CI above 0):

If control significantly wins:

If no significance (most likely outcome):

If guardrail violated:

Key Takeaways

Common use cases

Best AI model for this

Pro tips

Customization tips

Variants

Feature Test

Marketing Campaign

Pricing Experiment

UX/Design Test

Frequently asked questions

Explore more Originals