name: Experiment Design description: Comprehensive guide to A/B testing, multivariate testing, statistical significance, and experiment analysis for data-driven product decisions

Experiment Design

Types of Experiments

1. A/B Test (Two Variants)

What: Compare two versions (A vs B)

Example:

Control (A): Blue "Buy Now" button
Treatment (B): Green "Buy Now" button

When to Use:

Testing single change
Clear hypothesis
Binary decision (ship or don't ship)

Pros:

Simple to implement
Easy to analyze
Clear winner

Cons:

Only tests one change
Can't test interactions

2. Multivariate Test (Multiple Changes)

What: Test multiple changes simultaneously

Example:

Variable 1: Button color (Blue, Green, Red)
Variable 2: Button text ("Buy Now", "Add to Cart", "Get Started")
Variants: 3 × 3 = 9 combinations

When to Use:

Testing multiple elements
Want to find best combination
Have enough traffic

Pros:

Test interactions between variables
Find optimal combination

Cons:

Requires much more traffic
Complex analysis
Longer test duration

3. Sequential Testing

What: Continuously monitor and stop early if clear winner

Example:

Start A/B test
Check results daily
Stop when statistical significance reached (could be day 3 or day 14)

When to Use:

Want to ship winners fast
High traffic
Using tools that support it (Statsig, GrowthBook)

Pros:

Faster results
Less opportunity cost

Cons:

Requires special statistical methods
Can't "peek" with traditional A/B tests

4. Holdout Groups (Long-Term Effects)

What: Keep small % of users on old experience permanently

Example:

95% of users: New feature
5% of users: Old experience (holdout)

When to Use:

Measure long-term effects
Detect delayed negative impacts
Validate cumulative changes

Pros:

Detects long-term issues
Measures true impact

Cons:

Some users get worse experience
Requires ongoing monitoring

When to Experiment

✅ Experiment When:

Significant Features (High Impact)
- Major redesign
- New pricing model
- Core flow changes
Uncertain Outcomes
- Don't know if it will work
- Conflicting opinions
- No clear data
Multiple Solution Options
- Two different approaches
- Want to pick the best
Optimization Opportunities
- Incremental improvements
- Conversion optimization
- Engagement optimization

❌ Don't Experiment When:

Obvious Bugs/Fixes
- Broken functionality
- Security issues
- Legal compliance
Very Low Traffic
- Can't reach statistical significance
- Would take months
Trivial Changes
- Copy typo fix
- Minor styling adjustment
Ethical Issues
- Manipulative dark patterns
- Harmful to users

Experiment Design Process

Step 1: Define Hypothesis

Template:

"If we [change], then [metric] will [improve by X%], because [reasoning]."

Example:

"If we change the CTA button from blue to green, then click-through rate will increase by 10%, because green is more attention-grabbing."

Step 2: Choose Metrics

Primary Metric: What you're optimizing

Example: Click-through rate

Secondary Metrics: Other important outcomes

Example: Conversion rate, revenue per user

Counter Metrics: Watch for negatives

Example: Bounce rate, time on page

Step 3: Determine Sample Size

Inputs:

Baseline conversion rate: 5%
Expected improvement: 10% relative lift (5% → 5.5%)
Significance level: 0.05 (95% confidence)
Power: 0.80 (80% chance of detecting effect)

Output:

Sample size needed: ~31,000 users per variant

Tools:

Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
Optimizely sample size calculator

Step 4: Set Test Duration

Factors:

Sample size needed
Daily traffic
Weekly patterns (run at least 1-2 weeks)
Business cycles

Example:

Sample size: 31,000 per variant (62,000 total)
Daily traffic: 5,000
Duration: 62,000 / 5,000 = 12.4 days → Run for 2 weeks

Step 5: Design Variants

Control (A): Current experience Treatment (B): New experience

Best Practices:

Change only one thing (for A/B test)
Make change meaningful (not trivial)
Ensure variants are distinct

Step 6: Launch Test

Checklist:

Step 7: Analyze Results

Check:

Statistical significance (p < 0.05)
Practical significance (is improvement meaningful?)
Secondary metrics (any red flags?)
Segment analysis (works for everyone?)

Step 8: Decide (Ship, Iterate, Kill)

Ship if:

Positive, significant, no red flags

Iterate if:

Mixed results, some segments good

Kill if:

Negative, not significant, opportunity cost too high

Choosing Metrics

Primary Metric (What We're Optimizing)

Characteristics:

Directly tied to hypothesis
Sensitive to change
Measurable in test duration

Examples:

Click-through rate (CTR)
Conversion rate
Sign-up completion rate
Time to first action

Bad Primary Metrics:

Revenue (too noisy, delayed)
Retention (takes too long to measure)
NPS (survey-based, low sample)

Secondary Metrics (Guardrails, Side Effects)

Purpose: Ensure we're not breaking other things

Examples:

Revenue per user
Engagement (sessions per user)
Feature adoption
Customer satisfaction

Counter Metrics (Watch for Negatives)

Purpose: Detect unintended negative consequences

Examples:

Bounce rate (users leaving immediately)
Error rate (technical issues)
Support tickets (confusion)
Churn rate (users leaving)

Example: Checkout Flow Test

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%."

Metrics:

Primary: Checkout conversion rate
Secondary: Average order value, time to complete checkout
Counter: Cart abandonment rate, error rate, support tickets

Statistical Significance

P-Value < 0.05 (95% Confidence)

What it Means:

Less than 5% chance result is due to random chance
95% confident the effect is real

Example:

Control: 5.0% conversion
Treatment: 5.5% conversion
P-value: 0.03 ✅ (< 0.05, statistically significant)

Interpretation:

"We're 95% confident that the treatment is better than control."

Statistical Power (80%+)

What it Means:

80% chance of detecting an effect if it exists
Reduces false negatives

Example:

Power: 80%
Means: 20% chance of missing a real effect

Minimum Detectable Effect (MDE)

What it Means:

Smallest effect size you can reliably detect
Depends on sample size

Example:

Baseline: 5% conversion
Sample size: 10,000 per variant
MDE: 0.5% absolute (10% relative)
Can detect: 5.0% → 5.5% or larger

Trade-off:

Larger sample size → Smaller MDE (detect smaller effects)
Smaller sample size → Larger MDE (only detect big effects)

Sample Size Calculation

Formula (Simplified)

n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²

Where:
- n = sample size per variant
- Z_α/2 = 1.96 (for 95% confidence)
- Z_β = 0.84 (for 80% power)
- p₁ = baseline conversion rate
- p₂ = expected conversion rate

Example Calculation

Inputs:

Baseline conversion rate (p₁): 5% = 0.05
Expected improvement: 10% relative lift
New conversion rate (p₂): 5.5% = 0.055
Significance level (α): 0.05
Power (1-β): 0.80

Calculation:

n = (1.96 + 0.84)² × (0.05×0.95 + 0.055×0.945) / (0.05 - 0.055)²
n = 7.84 × (0.0475 + 0.052) / 0.000025
n = 7.84 × 0.0995 / 0.000025
n ≈ 31,200 per variant

Total sample size: 62,400 users

Using Online Calculators

Evan Miller's Calculator:

Go to https://www.evanmiller.org/ab-testing/sample-size.html
Enter baseline conversion rate: 5%
Enter minimum detectable effect: 10% (relative)
Get sample size: ~31,000 per variant

Optimizely Calculator:

Go to Optimizely sample size calculator
Enter baseline: 5%
Enter minimum detectable effect: 0.5% (absolute)
Get sample size: ~31,000 per variant

Test Duration

Minimum Duration: 1-2 Weeks

Why:

Capture weekly patterns (weekday vs weekend)
Avoid day-of-week bias
Account for user behavior cycles

Example:

Don't run Monday-Wednesday only
Run at least Monday-Sunday (1 full week)

Full Business Cycles

Examples:

E-commerce: Include payday (1st and 15th of month)
B2B SaaS: Include full week (avoid Friday-only)
Seasonal: Avoid holidays (unless testing holiday-specific)

Enough Data for Significance

Formula:

Duration = Sample Size Needed / Daily Traffic

Example:

Sample size: 62,000 total
Daily traffic: 5,000
Duration: 62,000 / 5,000 = 12.4 days
Run for: 2 weeks (14 days)

Not Too Long (Opportunity Cost)

Trade-off:

Longer test = More confidence
Longer test = Delayed learnings, slower iteration

Guideline:

Most tests: 1-4 weeks
High-traffic sites: 1-2 weeks
Low-traffic sites: 2-4 weeks
Don't run > 1 month (diminishing returns)

Experiment Variants

Control (Current Experience)

What: The existing experience

Example:

Current checkout flow (5 steps)
Current button color (blue)
Current pricing page

Purpose: Baseline for comparison

Treatment (New Experience)

What: The proposed change

Example:

New checkout flow (3 steps)
New button color (green)
New pricing page

Purpose: Test hypothesis

Multiple Treatments (If Testing Different Approaches)

Example:

Control: 5-step checkout
Treatment A: 3-step checkout (combine steps)
Treatment B: 1-page checkout (all on one page)

Traffic Split:

Control: 33%
Treatment A: 33%
Treatment B: 34%

Analysis:

Compare each treatment to control
Compare treatments to each other

Randomization

User-Level Randomization (Consistent Experience)

What: Each user always sees same variant

How:

const variant = hashUserId(userId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

Logged-in users
Want consistent experience
Testing flows (multi-step)

Pros:

Consistent experience
No confusion

Cons:

Requires user ID

Session-Level (For Anonymous Users)

What: Each session sees same variant (but different sessions can differ)

How:

const variant = hashSessionId(sessionId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

Anonymous users
Single-page tests

Pros:

Works for anonymous users

Cons:

Same user can see different variants across sessions

Stratified Sampling (For Segments)

What: Ensure even distribution across segments

Example:

Segment 1: Free users (50% control, 50% treatment)
Segment 2: Paid users (50% control, 50% treatment)

Why:

Avoid imbalanced segments
Enable segment analysis

Common Pitfalls

1. Peeking (Stopping Test Early When "Winning")

Problem:

Day 3: Treatment is winning! (p = 0.04) → Ship it!
Day 7: Treatment is losing... (p = 0.12) → Oops.

Why It's Bad:

Increases false positive rate
P-value fluctuates during test

Solution:

Decide sample size upfront
Don't look until test completes
Or use sequential testing (proper method)

2. Sample Ratio Mismatch (Uneven Splits)

Problem:

Expected: 50% control, 50% treatment
Actual: 48% control, 52% treatment

Why It's Bad:

Indicates randomization bug
Results may be invalid

Solution:

Check sample ratio before analyzing
Investigate if mismatch > 1%

3. Novelty Effect (Users Trying New Thing)

Problem:

Week 1: Treatment is winning! (+20%)
Week 4: Treatment is same as control (0%)

Why It's Bad:

Users try new thing out of curiosity
Effect fades over time

Solution:

Run test longer (2-4 weeks)
Use holdout group for long-term measurement
Segment by new vs returning users

4. Seasonality (Testing During Holidays)

Problem:

Test during Black Friday: +50% conversion
Test during normal week: +5% conversion

Why It's Bad:

Holiday behavior is different
Results don't generalize

Solution:

Avoid testing during holidays
Or run test across multiple weeks (include holiday + normal)

Sequential Testing

What is Sequential Testing?

Traditional A/B Test:

Decide sample size upfront
Run until sample size reached
Analyze once at end

Sequential Testing:

Monitor continuously
Stop early if clear winner
Adjust significance threshold

How It Works

Algorithm:

Use adjusted significance threshold (not 0.05)
Account for multiple looks
Stop when threshold crossed

Example (Simplified):

Day 1: p = 0.10 → Continue
Day 3: p = 0.03 → Continue
Day 5: p = 0.001 → Stop! (clear winner)

Tools That Support Sequential Testing

Statsig: Built-in sequential testing
GrowthBook: Bayesian statistics
Optimizely: Stats Engine (sequential)

Benefits

Faster results (stop early if clear winner)
Less opportunity cost
Detect large effects quickly

Drawbacks

Requires special tools
Can't use traditional p-value
More complex

Holdout Groups

What is a Holdout Group?

Definition: Small % of users kept on old experience permanently

Example:

95% of users: New feature
5% of users: Old experience (holdout)

Why Use Holdout Groups?

Measure Long-Term Effects:

A/B test shows +10% conversion in 2 weeks
Holdout shows +5% conversion after 6 months
Learning: Effect diminishes over time

Detect Delayed Negative Impacts:

A/B test shows +15% signups
Holdout shows +10% churn after 3 months
Learning: Feature attracts wrong users

How Long to Keep Holdout?

Guideline:

1-3 months for most features
6-12 months for major changes
Permanent for critical features

When to Remove Holdout?

Remove if:

No long-term differences detected
Opportunity cost too high (5% of users on worse experience)
Feature is critical (everyone should have it)

Experiment Analysis

Step 1: Compare Primary Metric

Example:

Control: 5.0% conversion
Treatment: 5.5% conversion
Lift: +10% relative
P-value: 0.03 ✅

Decision: Treatment is statistically significantly better.

Step 2: Check Secondary Metrics

Example:

Revenue per user: $10.50 (control) vs $11.20 (treatment) ✅
Time to checkout: 3.2 min (control) vs 2.8 min (treatment) ✅

Decision: Secondary metrics also improved.

Step 3: Check Counter Metrics

Example:

Bounce rate: 30% (control) vs 32% (treatment) ⚠️
Error rate: 0.5% (control) vs 0.5% (treatment) ✅

Decision: Slight increase in bounce rate, investigate.

Step 4: Segment Analysis

Did it work for everyone?

Segment	Control	Treatment	Lift
Mobile	4.5%	5.2%	+15% ✅
Desktop	5.5%	5.8%	+5% ✅
Free users	3.0%	3.6%	+20% ✅
Paid users	7.0%	7.1%	+1% ⚠️

Learning: Works great for mobile and free users, minimal impact on paid users.

Step 5: Statistical Significance

Check:

P-value < 0.05 ✅
Confidence interval doesn't include 0 ✅

Example:

Lift: +10%
95% CI: [+5%, +15%]
Interpretation: We're 95% confident the true lift is between 5% and 15%.

Step 6: Practical Significance

Is the improvement meaningful?

Example:

Statistically significant: Yes (p = 0.04)
Lift: +0.1% (5.0% → 5.005%)
Decision: Not practically significant (too small to matter)

Guideline:

Small lift but high volume → Ship (e.g., +0.1% on 1M users = 1,000 more conversions)
Large lift but low volume → Maybe ship (e.g., +50% on 100 users = 50 more conversions)

Decision Framework

Ship If:

✅ Positive: Treatment is better than control ✅ Significant: P-value < 0.05 ✅ No Red Flags: Secondary and counter metrics look good ✅ Works for Key Segments: At least works for majority

Example:

Conversion: +10% (p = 0.03) ✅
Revenue: +8% (p = 0.05) ✅
Bounce rate: No change ✅
Works for mobile and desktop ✅
Decision: Ship!

Iterate If:

⚠️ Mixed Results: Some metrics up, some down ⚠️ Works for Some Segments Only: E.g., only mobile, not desktop ⚠️ Close to Significance: P = 0.06 (just missed)

Example:

Conversion: +10% (p = 0.03) ✅
Revenue: -5% (p = 0.08) ⚠️
Decision: Iterate. Conversion is up but revenue is down. Investigate why.

Kill If:

❌ Negative: Treatment is worse than control ❌ Not Significant: P-value > 0.05 ❌ Opportunity Cost Too High: Could be working on better ideas

Example:

Conversion: +2% (p = 0.15) ❌
Took 4 weeks to test
Decision: Kill. Not significant, move on to next idea.

Tools

Feature Flags

LaunchDarkly:

Feature flag management
Gradual rollouts
Kill switches

Split.io:

Feature flags + experimentation
Real-time metrics

Unleash:

Open-source feature flags
Self-hosted option

Experimentation Platforms

Optimizely:

Full-stack experimentation
Visual editor for web
Stats Engine (sequential testing)

VWO (Visual Website Optimizer):

A/B testing for web
Heatmaps, session recordings
Visual editor

GrowthBook:

Open-source experimentation
Bayesian statistics
Feature flags

Statsig:

Modern experimentation platform
Sequential testing
Free tier

Analytics

Amplitude:

Product analytics
Funnel analysis
Cohort analysis

Mixpanel:

Event-based analytics
A/B test analysis
Retention analysis

PostHog:

Open-source product analytics
Feature flags
Session replay

A/B Testing for Engineers

1. Feature Flag Implementation

Node.js (LaunchDarkly):

const LaunchDarkly = require('launchdarkly-node-server-sdk');

const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY);

await client.waitForInitialization();

app.get('/checkout', async (req, res) => {
  const user = {
    key: req.user.id,
    email: req.user.email,
    custom: {
      plan: req.user.plan
    }
  };
  
  const showNewCheckout = await client.variation('new-checkout-flow', user, false);
  
  if (showNewCheckout) {
    res.render('checkout-new');
  } else {
    res.render('checkout-old');
  }
});

Python (Statsig):

from statsig import statsig

statsig.initialize(os.environ['STATSIG_SERVER_KEY'])

@app.route('/checkout')
def checkout():
    user = {
        'userID': current_user.id,
        'email': current_user.email,
        'custom': {
            'plan': current_user.plan
        }
    }
    
    show_new_checkout = statsig.check_gate(user, 'new_checkout_flow')
    
    if show_new_checkout:
        return render_template('checkout_new.html')
    else:
        return render_template('checkout_old.html')

2. Metric Instrumentation

Segment (Event Tracking):

const Analytics = require('analytics-node');
const analytics = new Analytics(process.env.SEGMENT_WRITE_KEY);

// Track checkout started
analytics.track({
  userId: user.id,
  event: 'Checkout Started',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    cart_value: cart.total,
    items_count: cart.items.length
  }
});

// Track checkout completed
analytics.track({
  userId: user.id,
  event: 'Checkout Completed',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    order_id: order.id,
    revenue: order.total
  }
});

3. Data Pipeline

Architecture:

Application
    ↓ (events)
Segment
    ↓ (forwards to)
├── Amplitude (analytics)
├── Mixpanel (analytics)
├── Data Warehouse (BigQuery, Snowflake)
└── Statsig (experimentation)

4. Results Dashboard

Grafana Dashboard:

{
  "dashboard": {
    "title": "A/B Test: New Checkout Flow",
    "panels": [
      {
        "title": "Conversion Rate by Variant",
        "targets": [
          {
            "expr": "sum(checkout_completed{variant='control'}) / sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_completed{variant='treatment'}) / sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      },
      {
        "title": "Sample Size",
        "targets": [
          {
            "expr": "sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      }
    ]
  }
}

Real Experiment Examples

Example 1: Button Color Test (Classic)

Hypothesis:

"If we change the CTA button from blue to orange, click-through rate will increase by 10%, because orange is more attention-grabbing."

Test:

Control: Blue button
Treatment: Orange button
Sample size: 10,000 per variant
Duration: 1 week

Results:

Control: 5.2% CTR
Treatment: 5.7% CTR
Lift: +9.6%
P-value: 0.04 ✅

Decision: Ship orange button.

Example 2: Checkout Flow Optimization

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%, because users abandon due to flow length."

Test:

Control: 5-step checkout
Treatment: 3-step checkout (combined steps)
Sample size: 50,000 per variant
Duration: 2 weeks

Results:

Control: 8.5% conversion
Treatment: 9.8% conversion
Lift: +15.3%
P-value: 0.001 ✅

Secondary Metrics:

Time to checkout: 4.2 min → 3.1 min ✅
Error rate: 2.1% → 1.8% ✅

Decision: Ship 3-step checkout.

Example 3: Pricing Page Variants

Hypothesis:

"If we show annual pricing first (instead of monthly), annual plan adoption will increase by 25%, because anchoring effect."

Test:

Control: Monthly pricing shown first
Treatment: Annual pricing shown first
Sample size: 20,000 per variant
Duration: 3 weeks

Results:

Control: 12% annual adoption
Treatment: 18% annual adoption
Lift: +50%
P-value: 0.001 ✅

Counter Metrics:

Overall conversion: 10.5% → 10.2% ⚠️ (slight drop)

Decision: Ship, but monitor overall conversion.

Example 4: Onboarding Flow

Hypothesis:

"If we add an interactive tutorial in onboarding, activation rate will increase by 30%, because users don't know how to get started."

Test:

Control: No tutorial
Treatment: Interactive tutorial (5 steps)
Sample size: 15,000 per variant
Duration: 2 weeks

Results:

Control: 25% activation rate
Treatment: 28% activation rate
Lift: +12%
P-value: 0.08 ❌ (not significant)

Segment Analysis:

New users: +20% (p = 0.03) ✅
Returning users: +2% (p = 0.5) ❌

Decision: Iterate. Show tutorial only to new users.

Advanced: Bayesian A/B Testing

Traditional (Frequentist) A/B Testing

Approach:

Null hypothesis: No difference between A and B
P-value: Probability of seeing this result if null is true
Reject null if p < 0.05

Interpretation:

"There's a 95% chance the result is not due to random chance."

Bayesian A/B Testing

Approach:

Prior belief: What we believe before test
Likelihood: Data from test
Posterior belief: Updated belief after test

Interpretation:

"There's a 95% probability that B is better than A."

Benefits of Bayesian

Easier to Interpret:
- "95% probability B is better" (intuitive)
- vs "p = 0.03" (confusing)
Can Stop Early:
- No peeking problem
- Stop when confident enough
Incorporates Prior Knowledge:
- Use historical data
- More accurate with small samples

Tools That Use Bayesian

GrowthBook: Bayesian by default
VWO: Bayesian engine option
Google Optimize: Bayesian (deprecated)

Example

Test:

Control: 5.0% conversion (1000 users)
Treatment: 5.5% conversion (1000 users)

Frequentist:

P-value: 0.15 (not significant)
Decision: Can't conclude

Bayesian:

Probability B > A: 87%
Expected lift: +10%
Decision: Likely better, but not confident enough (need 95%)

Summary

Quick Reference

Experiment Types:

A/B test: Two variants
Multivariate: Multiple changes
Sequential: Stop early
Holdout: Long-term measurement

When to Experiment:

Significant features
Uncertain outcomes
Multiple options
Optimization

Process:

Define hypothesis
Choose metrics
Calculate sample size
Set duration
Design variants
Launch
Analyze
Decide

Metrics:

Primary: What we're optimizing
Secondary: Guardrails
Counter: Watch for negatives

Statistical Significance:

P-value < 0.05
Power > 80%
Minimum detectable effect

Common Pitfalls:

Peeking
Sample ratio mismatch
Novelty effect
Seasonality

Decision Framework:

Ship: Positive, significant, no red flags
Iterate: Mixed results
Kill: Negative, not significant

Tools:

Feature flags: LaunchDarkly, Split.io
Experimentation: Optimizely, Statsig, GrowthBook
Analytics: Amplitude, Mixpanel, PostHog

Experiment Design

Skill Details

Repository Files

name: Experiment Design description: Comprehensive guide to A/B testing, multivariate testing, statistical significance, and experiment analysis for data-driven product decisions

Experiment Design

Types of Experiments

1. A/B Test (Two Variants)

2. Multivariate Test (Multiple Changes)

3. Sequential Testing

4. Holdout Groups (Long-Term Effects)

When to Experiment

✅ Experiment When:

❌ Don't Experiment When:

Experiment Design Process

Step 1: Define Hypothesis

Step 2: Choose Metrics

Step 3: Determine Sample Size

Step 4: Set Test Duration

Step 5: Design Variants

Step 6: Launch Test

Step 7: Analyze Results

Step 8: Decide (Ship, Iterate, Kill)

Choosing Metrics

Primary Metric (What We're Optimizing)

Secondary Metrics (Guardrails, Side Effects)

Counter Metrics (Watch for Negatives)

Example: Checkout Flow Test

Statistical Significance

P-Value < 0.05 (95% Confidence)

Statistical Power (80%+)

Minimum Detectable Effect (MDE)

Sample Size Calculation

Formula (Simplified)

Example Calculation

Using Online Calculators

Test Duration

Minimum Duration: 1-2 Weeks

Full Business Cycles

Enough Data for Significance

Not Too Long (Opportunity Cost)

Experiment Variants

Control (Current Experience)

Treatment (New Experience)

Multiple Treatments (If Testing Different Approaches)

Randomization

User-Level Randomization (Consistent Experience)

Session-Level (For Anonymous Users)

Stratified Sampling (For Segments)

Common Pitfalls

1. Peeking (Stopping Test Early When "Winning")

2. Sample Ratio Mismatch (Uneven Splits)

3. Novelty Effect (Users Trying New Thing)

4. Seasonality (Testing During Holidays)

Sequential Testing

What is Sequential Testing?

How It Works

Tools That Support Sequential Testing

Benefits

Drawbacks

Holdout Groups

What is a Holdout Group?

Why Use Holdout Groups?

How Long to Keep Holdout?

When to Remove Holdout?

Experiment Analysis

Step 1: Compare Primary Metric

Step 2: Check Secondary Metrics

Step 3: Check Counter Metrics

Step 4: Segment Analysis

Step 5: Statistical Significance

Step 6: Practical Significance

Decision Framework

Ship If:

Iterate If:

Kill If:

Tools

Feature Flags

Experimentation Platforms

Analytics

A/B Testing for Engineers