Experiment Design

by AmnadTaowsoam

designtestingdata

Comprehensive guide to A/B testing, multivariate testing, statistical significance, and experiment analysis for data-driven product decisions

Skill Details

Repository Files

1 file in this skill directory


name: Experiment Design description: Comprehensive guide to A/B testing, multivariate testing, statistical significance, and experiment analysis for data-driven product decisions

Experiment Design

Types of Experiments

1. A/B Test (Two Variants)

What: Compare two versions (A vs B)

Example:

  • Control (A): Blue "Buy Now" button
  • Treatment (B): Green "Buy Now" button

When to Use:

  • Testing single change
  • Clear hypothesis
  • Binary decision (ship or don't ship)

Pros:

  • Simple to implement
  • Easy to analyze
  • Clear winner

Cons:

  • Only tests one change
  • Can't test interactions

2. Multivariate Test (Multiple Changes)

What: Test multiple changes simultaneously

Example:

  • Variable 1: Button color (Blue, Green, Red)
  • Variable 2: Button text ("Buy Now", "Add to Cart", "Get Started")
  • Variants: 3 × 3 = 9 combinations

When to Use:

  • Testing multiple elements
  • Want to find best combination
  • Have enough traffic

Pros:

  • Test interactions between variables
  • Find optimal combination

Cons:

  • Requires much more traffic
  • Complex analysis
  • Longer test duration

3. Sequential Testing

What: Continuously monitor and stop early if clear winner

Example:

  • Start A/B test
  • Check results daily
  • Stop when statistical significance reached (could be day 3 or day 14)

When to Use:

  • Want to ship winners fast
  • High traffic
  • Using tools that support it (Statsig, GrowthBook)

Pros:

  • Faster results
  • Less opportunity cost

Cons:

  • Requires special statistical methods
  • Can't "peek" with traditional A/B tests

4. Holdout Groups (Long-Term Effects)

What: Keep small % of users on old experience permanently

Example:

  • 95% of users: New feature
  • 5% of users: Old experience (holdout)

When to Use:

  • Measure long-term effects
  • Detect delayed negative impacts
  • Validate cumulative changes

Pros:

  • Detects long-term issues
  • Measures true impact

Cons:

  • Some users get worse experience
  • Requires ongoing monitoring

When to Experiment

✅ Experiment When:

  1. Significant Features (High Impact)

    • Major redesign
    • New pricing model
    • Core flow changes
  2. Uncertain Outcomes

    • Don't know if it will work
    • Conflicting opinions
    • No clear data
  3. Multiple Solution Options

    • Two different approaches
    • Want to pick the best
  4. Optimization Opportunities

    • Incremental improvements
    • Conversion optimization
    • Engagement optimization

❌ Don't Experiment When:

  1. Obvious Bugs/Fixes

    • Broken functionality
    • Security issues
    • Legal compliance
  2. Very Low Traffic

    • Can't reach statistical significance
    • Would take months
  3. Trivial Changes

    • Copy typo fix
    • Minor styling adjustment
  4. Ethical Issues

    • Manipulative dark patterns
    • Harmful to users

Experiment Design Process

Step 1: Define Hypothesis

Template:

"If we [change], then [metric] will [improve by X%], because [reasoning]."

Example:

"If we change the CTA button from blue to green, then click-through rate will increase by 10%, because green is more attention-grabbing."

Step 2: Choose Metrics

Primary Metric: What you're optimizing

  • Example: Click-through rate

Secondary Metrics: Other important outcomes

  • Example: Conversion rate, revenue per user

Counter Metrics: Watch for negatives

  • Example: Bounce rate, time on page

Step 3: Determine Sample Size

Inputs:

  • Baseline conversion rate: 5%
  • Expected improvement: 10% relative lift (5% → 5.5%)
  • Significance level: 0.05 (95% confidence)
  • Power: 0.80 (80% chance of detecting effect)

Output:

  • Sample size needed: ~31,000 users per variant

Tools:

Step 4: Set Test Duration

Factors:

  • Sample size needed
  • Daily traffic
  • Weekly patterns (run at least 1-2 weeks)
  • Business cycles

Example:

  • Sample size: 31,000 per variant (62,000 total)
  • Daily traffic: 5,000
  • Duration: 62,000 / 5,000 = 12.4 days → Run for 2 weeks

Step 5: Design Variants

Control (A): Current experience Treatment (B): New experience

Best Practices:

  • Change only one thing (for A/B test)
  • Make change meaningful (not trivial)
  • Ensure variants are distinct

Step 6: Launch Test

Checklist:

  • Hypothesis documented
  • Metrics instrumented
  • Sample size calculated
  • Randomization working
  • QA tested both variants
  • Monitoring dashboard ready

Step 7: Analyze Results

Check:

  • Statistical significance (p < 0.05)
  • Practical significance (is improvement meaningful?)
  • Secondary metrics (any red flags?)
  • Segment analysis (works for everyone?)

Step 8: Decide (Ship, Iterate, Kill)

Ship if:

  • Positive, significant, no red flags

Iterate if:

  • Mixed results, some segments good

Kill if:

  • Negative, not significant, opportunity cost too high

Choosing Metrics

Primary Metric (What We're Optimizing)

Characteristics:

  • Directly tied to hypothesis
  • Sensitive to change
  • Measurable in test duration

Examples:

  • Click-through rate (CTR)
  • Conversion rate
  • Sign-up completion rate
  • Time to first action

Bad Primary Metrics:

  • Revenue (too noisy, delayed)
  • Retention (takes too long to measure)
  • NPS (survey-based, low sample)

Secondary Metrics (Guardrails, Side Effects)

Purpose: Ensure we're not breaking other things

Examples:

  • Revenue per user
  • Engagement (sessions per user)
  • Feature adoption
  • Customer satisfaction

Counter Metrics (Watch for Negatives)

Purpose: Detect unintended negative consequences

Examples:

  • Bounce rate (users leaving immediately)
  • Error rate (technical issues)
  • Support tickets (confusion)
  • Churn rate (users leaving)

Example: Checkout Flow Test

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%."

Metrics:

  • Primary: Checkout conversion rate
  • Secondary: Average order value, time to complete checkout
  • Counter: Cart abandonment rate, error rate, support tickets

Statistical Significance

P-Value < 0.05 (95% Confidence)

What it Means:

  • Less than 5% chance result is due to random chance
  • 95% confident the effect is real

Example:

  • Control: 5.0% conversion
  • Treatment: 5.5% conversion
  • P-value: 0.03 ✅ (< 0.05, statistically significant)

Interpretation:

"We're 95% confident that the treatment is better than control."

Statistical Power (80%+)

What it Means:

  • 80% chance of detecting an effect if it exists
  • Reduces false negatives

Example:

  • Power: 80%
  • Means: 20% chance of missing a real effect

Minimum Detectable Effect (MDE)

What it Means:

  • Smallest effect size you can reliably detect
  • Depends on sample size

Example:

  • Baseline: 5% conversion
  • Sample size: 10,000 per variant
  • MDE: 0.5% absolute (10% relative)
  • Can detect: 5.0% → 5.5% or larger

Trade-off:

  • Larger sample size → Smaller MDE (detect smaller effects)
  • Smaller sample size → Larger MDE (only detect big effects)

Sample Size Calculation

Formula (Simplified)

n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²

Where:
- n = sample size per variant
- Z_α/2 = 1.96 (for 95% confidence)
- Z_β = 0.84 (for 80% power)
- p₁ = baseline conversion rate
- p₂ = expected conversion rate

Example Calculation

Inputs:

  • Baseline conversion rate (p₁): 5% = 0.05
  • Expected improvement: 10% relative lift
  • New conversion rate (p₂): 5.5% = 0.055
  • Significance level (α): 0.05
  • Power (1-β): 0.80

Calculation:

n = (1.96 + 0.84)² × (0.05×0.95 + 0.055×0.945) / (0.05 - 0.055)²
n = 7.84 × (0.0475 + 0.052) / 0.000025
n = 7.84 × 0.0995 / 0.000025
n ≈ 31,200 per variant

Total sample size: 62,400 users

Using Online Calculators

Evan Miller's Calculator:

  1. Go to https://www.evanmiller.org/ab-testing/sample-size.html
  2. Enter baseline conversion rate: 5%
  3. Enter minimum detectable effect: 10% (relative)
  4. Get sample size: ~31,000 per variant

Optimizely Calculator:

  1. Go to Optimizely sample size calculator
  2. Enter baseline: 5%
  3. Enter minimum detectable effect: 0.5% (absolute)
  4. Get sample size: ~31,000 per variant

Test Duration

Minimum Duration: 1-2 Weeks

Why:

  • Capture weekly patterns (weekday vs weekend)
  • Avoid day-of-week bias
  • Account for user behavior cycles

Example:

  • Don't run Monday-Wednesday only
  • Run at least Monday-Sunday (1 full week)

Full Business Cycles

Examples:

  • E-commerce: Include payday (1st and 15th of month)
  • B2B SaaS: Include full week (avoid Friday-only)
  • Seasonal: Avoid holidays (unless testing holiday-specific)

Enough Data for Significance

Formula:

Duration = Sample Size Needed / Daily Traffic

Example:

  • Sample size: 62,000 total
  • Daily traffic: 5,000
  • Duration: 62,000 / 5,000 = 12.4 days
  • Run for: 2 weeks (14 days)

Not Too Long (Opportunity Cost)

Trade-off:

  • Longer test = More confidence
  • Longer test = Delayed learnings, slower iteration

Guideline:

  • Most tests: 1-4 weeks
  • High-traffic sites: 1-2 weeks
  • Low-traffic sites: 2-4 weeks
  • Don't run > 1 month (diminishing returns)

Experiment Variants

Control (Current Experience)

What: The existing experience

Example:

  • Current checkout flow (5 steps)
  • Current button color (blue)
  • Current pricing page

Purpose: Baseline for comparison

Treatment (New Experience)

What: The proposed change

Example:

  • New checkout flow (3 steps)
  • New button color (green)
  • New pricing page

Purpose: Test hypothesis

Multiple Treatments (If Testing Different Approaches)

Example:

  • Control: 5-step checkout
  • Treatment A: 3-step checkout (combine steps)
  • Treatment B: 1-page checkout (all on one page)

Traffic Split:

  • Control: 33%
  • Treatment A: 33%
  • Treatment B: 34%

Analysis:

  • Compare each treatment to control
  • Compare treatments to each other

Randomization

User-Level Randomization (Consistent Experience)

What: Each user always sees same variant

How:

const variant = hashUserId(userId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

  • Logged-in users
  • Want consistent experience
  • Testing flows (multi-step)

Pros:

  • Consistent experience
  • No confusion

Cons:

  • Requires user ID

Session-Level (For Anonymous Users)

What: Each session sees same variant (but different sessions can differ)

How:

const variant = hashSessionId(sessionId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

  • Anonymous users
  • Single-page tests

Pros:

  • Works for anonymous users

Cons:

  • Same user can see different variants across sessions

Stratified Sampling (For Segments)

What: Ensure even distribution across segments

Example:

  • Segment 1: Free users (50% control, 50% treatment)
  • Segment 2: Paid users (50% control, 50% treatment)

Why:

  • Avoid imbalanced segments
  • Enable segment analysis

Common Pitfalls

1. Peeking (Stopping Test Early When "Winning")

Problem:

Day 3: Treatment is winning! (p = 0.04) → Ship it!
Day 7: Treatment is losing... (p = 0.12) → Oops.

Why It's Bad:

  • Increases false positive rate
  • P-value fluctuates during test

Solution:

  • Decide sample size upfront
  • Don't look until test completes
  • Or use sequential testing (proper method)

2. Sample Ratio Mismatch (Uneven Splits)

Problem:

Expected: 50% control, 50% treatment
Actual: 48% control, 52% treatment

Why It's Bad:

  • Indicates randomization bug
  • Results may be invalid

Solution:

  • Check sample ratio before analyzing
  • Investigate if mismatch > 1%

3. Novelty Effect (Users Trying New Thing)

Problem:

Week 1: Treatment is winning! (+20%)
Week 4: Treatment is same as control (0%)

Why It's Bad:

  • Users try new thing out of curiosity
  • Effect fades over time

Solution:

  • Run test longer (2-4 weeks)
  • Use holdout group for long-term measurement
  • Segment by new vs returning users

4. Seasonality (Testing During Holidays)

Problem:

Test during Black Friday: +50% conversion
Test during normal week: +5% conversion

Why It's Bad:

  • Holiday behavior is different
  • Results don't generalize

Solution:

  • Avoid testing during holidays
  • Or run test across multiple weeks (include holiday + normal)

Sequential Testing

What is Sequential Testing?

Traditional A/B Test:

  • Decide sample size upfront
  • Run until sample size reached
  • Analyze once at end

Sequential Testing:

  • Monitor continuously
  • Stop early if clear winner
  • Adjust significance threshold

How It Works

Algorithm:

  • Use adjusted significance threshold (not 0.05)
  • Account for multiple looks
  • Stop when threshold crossed

Example (Simplified):

Day 1: p = 0.10 → Continue
Day 3: p = 0.03 → Continue
Day 5: p = 0.001 → Stop! (clear winner)

Tools That Support Sequential Testing

  • Statsig: Built-in sequential testing
  • GrowthBook: Bayesian statistics
  • Optimizely: Stats Engine (sequential)

Benefits

  • Faster results (stop early if clear winner)
  • Less opportunity cost
  • Detect large effects quickly

Drawbacks

  • Requires special tools
  • Can't use traditional p-value
  • More complex

Holdout Groups

What is a Holdout Group?

Definition: Small % of users kept on old experience permanently

Example:

  • 95% of users: New feature
  • 5% of users: Old experience (holdout)

Why Use Holdout Groups?

Measure Long-Term Effects:

  • A/B test shows +10% conversion in 2 weeks
  • Holdout shows +5% conversion after 6 months
  • Learning: Effect diminishes over time

Detect Delayed Negative Impacts:

  • A/B test shows +15% signups
  • Holdout shows +10% churn after 3 months
  • Learning: Feature attracts wrong users

How Long to Keep Holdout?

Guideline:

  • 1-3 months for most features
  • 6-12 months for major changes
  • Permanent for critical features

When to Remove Holdout?

Remove if:

  • No long-term differences detected
  • Opportunity cost too high (5% of users on worse experience)
  • Feature is critical (everyone should have it)

Experiment Analysis

Step 1: Compare Primary Metric

Example:

  • Control: 5.0% conversion
  • Treatment: 5.5% conversion
  • Lift: +10% relative
  • P-value: 0.03 ✅

Decision: Treatment is statistically significantly better.

Step 2: Check Secondary Metrics

Example:

  • Revenue per user: $10.50 (control) vs $11.20 (treatment) ✅
  • Time to checkout: 3.2 min (control) vs 2.8 min (treatment) ✅

Decision: Secondary metrics also improved.

Step 3: Check Counter Metrics

Example:

  • Bounce rate: 30% (control) vs 32% (treatment) ⚠️
  • Error rate: 0.5% (control) vs 0.5% (treatment) ✅

Decision: Slight increase in bounce rate, investigate.

Step 4: Segment Analysis

Did it work for everyone?

Segment Control Treatment Lift
Mobile 4.5% 5.2% +15% ✅
Desktop 5.5% 5.8% +5% ✅
Free users 3.0% 3.6% +20% ✅
Paid users 7.0% 7.1% +1% ⚠️

Learning: Works great for mobile and free users, minimal impact on paid users.

Step 5: Statistical Significance

Check:

  • P-value < 0.05 ✅
  • Confidence interval doesn't include 0 ✅

Example:

  • Lift: +10%
  • 95% CI: [+5%, +15%]
  • Interpretation: We're 95% confident the true lift is between 5% and 15%.

Step 6: Practical Significance

Is the improvement meaningful?

Example:

  • Statistically significant: Yes (p = 0.04)
  • Lift: +0.1% (5.0% → 5.005%)
  • Decision: Not practically significant (too small to matter)

Guideline:

  • Small lift but high volume → Ship (e.g., +0.1% on 1M users = 1,000 more conversions)
  • Large lift but low volume → Maybe ship (e.g., +50% on 100 users = 50 more conversions)

Decision Framework

Ship If:

Positive: Treatment is better than control ✅ Significant: P-value < 0.05 ✅ No Red Flags: Secondary and counter metrics look good ✅ Works for Key Segments: At least works for majority

Example:

  • Conversion: +10% (p = 0.03) ✅
  • Revenue: +8% (p = 0.05) ✅
  • Bounce rate: No change ✅
  • Works for mobile and desktop ✅
  • Decision: Ship!

Iterate If:

⚠️ Mixed Results: Some metrics up, some down ⚠️ Works for Some Segments Only: E.g., only mobile, not desktop ⚠️ Close to Significance: P = 0.06 (just missed)

Example:

  • Conversion: +10% (p = 0.03) ✅
  • Revenue: -5% (p = 0.08) ⚠️
  • Decision: Iterate. Conversion is up but revenue is down. Investigate why.

Kill If:

Negative: Treatment is worse than control ❌ Not Significant: P-value > 0.05 ❌ Opportunity Cost Too High: Could be working on better ideas

Example:

  • Conversion: +2% (p = 0.15) ❌
  • Took 4 weeks to test
  • Decision: Kill. Not significant, move on to next idea.

Tools

Feature Flags

LaunchDarkly:

  • Feature flag management
  • Gradual rollouts
  • Kill switches

Split.io:

  • Feature flags + experimentation
  • Real-time metrics

Unleash:

  • Open-source feature flags
  • Self-hosted option

Experimentation Platforms

Optimizely:

  • Full-stack experimentation
  • Visual editor for web
  • Stats Engine (sequential testing)

VWO (Visual Website Optimizer):

  • A/B testing for web
  • Heatmaps, session recordings
  • Visual editor

GrowthBook:

  • Open-source experimentation
  • Bayesian statistics
  • Feature flags

Statsig:

  • Modern experimentation platform
  • Sequential testing
  • Free tier

Analytics

Amplitude:

  • Product analytics
  • Funnel analysis
  • Cohort analysis

Mixpanel:

  • Event-based analytics
  • A/B test analysis
  • Retention analysis

PostHog:

  • Open-source product analytics
  • Feature flags
  • Session replay

A/B Testing for Engineers

1. Feature Flag Implementation

Node.js (LaunchDarkly):

const LaunchDarkly = require('launchdarkly-node-server-sdk');

const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY);

await client.waitForInitialization();

app.get('/checkout', async (req, res) => {
  const user = {
    key: req.user.id,
    email: req.user.email,
    custom: {
      plan: req.user.plan
    }
  };
  
  const showNewCheckout = await client.variation('new-checkout-flow', user, false);
  
  if (showNewCheckout) {
    res.render('checkout-new');
  } else {
    res.render('checkout-old');
  }
});

Python (Statsig):

from statsig import statsig

statsig.initialize(os.environ['STATSIG_SERVER_KEY'])

@app.route('/checkout')
def checkout():
    user = {
        'userID': current_user.id,
        'email': current_user.email,
        'custom': {
            'plan': current_user.plan
        }
    }
    
    show_new_checkout = statsig.check_gate(user, 'new_checkout_flow')
    
    if show_new_checkout:
        return render_template('checkout_new.html')
    else:
        return render_template('checkout_old.html')

2. Metric Instrumentation

Segment (Event Tracking):

const Analytics = require('analytics-node');
const analytics = new Analytics(process.env.SEGMENT_WRITE_KEY);

// Track checkout started
analytics.track({
  userId: user.id,
  event: 'Checkout Started',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    cart_value: cart.total,
    items_count: cart.items.length
  }
});

// Track checkout completed
analytics.track({
  userId: user.id,
  event: 'Checkout Completed',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    order_id: order.id,
    revenue: order.total
  }
});

3. Data Pipeline

Architecture:

Application
    ↓ (events)
Segment
    ↓ (forwards to)
├── Amplitude (analytics)
├── Mixpanel (analytics)
├── Data Warehouse (BigQuery, Snowflake)
└── Statsig (experimentation)

4. Results Dashboard

Grafana Dashboard:

{
  "dashboard": {
    "title": "A/B Test: New Checkout Flow",
    "panels": [
      {
        "title": "Conversion Rate by Variant",
        "targets": [
          {
            "expr": "sum(checkout_completed{variant='control'}) / sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_completed{variant='treatment'}) / sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      },
      {
        "title": "Sample Size",
        "targets": [
          {
            "expr": "sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      }
    ]
  }
}

Real Experiment Examples

Example 1: Button Color Test (Classic)

Hypothesis:

"If we change the CTA button from blue to orange, click-through rate will increase by 10%, because orange is more attention-grabbing."

Test:

  • Control: Blue button
  • Treatment: Orange button
  • Sample size: 10,000 per variant
  • Duration: 1 week

Results:

  • Control: 5.2% CTR
  • Treatment: 5.7% CTR
  • Lift: +9.6%
  • P-value: 0.04 ✅

Decision: Ship orange button.

Example 2: Checkout Flow Optimization

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%, because users abandon due to flow length."

Test:

  • Control: 5-step checkout
  • Treatment: 3-step checkout (combined steps)
  • Sample size: 50,000 per variant
  • Duration: 2 weeks

Results:

  • Control: 8.5% conversion
  • Treatment: 9.8% conversion
  • Lift: +15.3%
  • P-value: 0.001 ✅

Secondary Metrics:

  • Time to checkout: 4.2 min → 3.1 min ✅
  • Error rate: 2.1% → 1.8% ✅

Decision: Ship 3-step checkout.

Example 3: Pricing Page Variants

Hypothesis:

"If we show annual pricing first (instead of monthly), annual plan adoption will increase by 25%, because anchoring effect."

Test:

  • Control: Monthly pricing shown first
  • Treatment: Annual pricing shown first
  • Sample size: 20,000 per variant
  • Duration: 3 weeks

Results:

  • Control: 12% annual adoption
  • Treatment: 18% annual adoption
  • Lift: +50%
  • P-value: 0.001 ✅

Counter Metrics:

  • Overall conversion: 10.5% → 10.2% ⚠️ (slight drop)

Decision: Ship, but monitor overall conversion.

Example 4: Onboarding Flow

Hypothesis:

"If we add an interactive tutorial in onboarding, activation rate will increase by 30%, because users don't know how to get started."

Test:

  • Control: No tutorial
  • Treatment: Interactive tutorial (5 steps)
  • Sample size: 15,000 per variant
  • Duration: 2 weeks

Results:

  • Control: 25% activation rate
  • Treatment: 28% activation rate
  • Lift: +12%
  • P-value: 0.08 ❌ (not significant)

Segment Analysis:

  • New users: +20% (p = 0.03) ✅
  • Returning users: +2% (p = 0.5) ❌

Decision: Iterate. Show tutorial only to new users.


Advanced: Bayesian A/B Testing

Traditional (Frequentist) A/B Testing

Approach:

  • Null hypothesis: No difference between A and B
  • P-value: Probability of seeing this result if null is true
  • Reject null if p < 0.05

Interpretation:

"There's a 95% chance the result is not due to random chance."

Bayesian A/B Testing

Approach:

  • Prior belief: What we believe before test
  • Likelihood: Data from test
  • Posterior belief: Updated belief after test

Interpretation:

"There's a 95% probability that B is better than A."

Benefits of Bayesian

  1. Easier to Interpret:

    • "95% probability B is better" (intuitive)
    • vs "p = 0.03" (confusing)
  2. Can Stop Early:

    • No peeking problem
    • Stop when confident enough
  3. Incorporates Prior Knowledge:

    • Use historical data
    • More accurate with small samples

Tools That Use Bayesian

  • GrowthBook: Bayesian by default
  • VWO: Bayesian engine option
  • Google Optimize: Bayesian (deprecated)

Example

Test:

  • Control: 5.0% conversion (1000 users)
  • Treatment: 5.5% conversion (1000 users)

Frequentist:

  • P-value: 0.15 (not significant)
  • Decision: Can't conclude

Bayesian:

  • Probability B > A: 87%
  • Expected lift: +10%
  • Decision: Likely better, but not confident enough (need 95%)

Summary

Quick Reference

Experiment Types:

  • A/B test: Two variants
  • Multivariate: Multiple changes
  • Sequential: Stop early
  • Holdout: Long-term measurement

When to Experiment:

  • Significant features
  • Uncertain outcomes
  • Multiple options
  • Optimization

Process:

  1. Define hypothesis
  2. Choose metrics
  3. Calculate sample size
  4. Set duration
  5. Design variants
  6. Launch
  7. Analyze
  8. Decide

Metrics:

  • Primary: What we're optimizing
  • Secondary: Guardrails
  • Counter: Watch for negatives

Statistical Significance:

  • P-value < 0.05
  • Power > 80%
  • Minimum detectable effect

Common Pitfalls:

  • Peeking
  • Sample ratio mismatch
  • Novelty effect
  • Seasonality

Decision Framework:

  • Ship: Positive, significant, no red flags
  • Iterate: Mixed results
  • Kill: Negative, not significant

Tools:

  • Feature flags: LaunchDarkly, Split.io
  • Experimentation: Optimizely, Statsig, GrowthBook
  • Analytics: Amplitude, Mixpanel, PostHog

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Team Composition Analysis

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

artdesign

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Skill Information

Category:Creative
Last Updated:1/16/2026