Ab Testing Statistician

by iammarkps

designtesting

Expert in statistical analysis for blind A/B and ABX audio testing. Validates randomization, calculates statistical significance, and ensures proper experimental design. Use when implementing A/B test features or analyzing test results.

Skill Details

Repository Files

4 files in this skill directory


name: ab-testing-statistician description: Expert in statistical analysis for blind A/B and ABX audio testing. Validates randomization, calculates statistical significance, and ensures proper experimental design. Use when implementing A/B test features or analyzing test results.

A/B Testing Statistician

Specialized agent for designing and validating blind audio comparison tests (A/B, Blind AB, ABX) with proper statistical analysis.

Overview of Audio A/B Testing

Test Modes

Mode Description User Knows? Purpose
AB Switch between A and B Yes Quick comparison, training
Blind AB A and B randomly mapped to Options 1 and 2 No Unbiased preference detection
ABX X is secretly either A or B, user guesses No Audibility testing (can you hear the difference?)

Why Blind Testing Matters

Confirmation Bias: Listeners tend to prefer what they expect to be better.

Example:

Non-blind: "This expensive cable sounds clearer!" (placebo effect)
Blind: "I can't tell the difference" (objective reality)

Session Management

Session State (Rust)

#[derive(Clone, Serialize, Deserialize)]
pub struct ABSession {
    pub mode: ABTestMode,           // AB, BlindAB, or ABX
    pub preset_a_name: String,
    pub preset_b_name: String,
    pub trim_db: f32,               // Loudness compensation for B
    pub total_trials: usize,
    pub current_trial: usize,
    pub hidden_mapping: Vec<bool>,  // For BlindAB: true = Option1 is A
    pub x_is_a: Vec<bool>,          // For ABX: true = X is A
    pub answers: Vec<ABAnswer>,     // User responses
}

#[derive(Clone, Serialize, Deserialize)]
pub enum ABTestMode {
    AB,       // Non-blind switching
    BlindAB,  // Blind preference test
    ABX,      // Blind audibility test
}

#[derive(Clone, Serialize, Deserialize)]
pub struct ABAnswer {
    pub trial: usize,
    pub selected_option: String,    // "A", "B", "1", "2", or "X"
    pub timestamp: u64,             // Milliseconds since session start
}

Randomization (Critical!)

BlindAB Mode: Each trial randomly maps A/B to Options 1/2:

pub fn create_blind_ab_session(
    preset_a: String,
    preset_b: String,
    num_trials: usize,
    trim_db: f32,
) -> ABSession {
    use rand::Rng;
    let mut rng = rand::thread_rng();

    // Randomize each trial independently
    let hidden_mapping: Vec<bool> = (0..num_trials)
        .map(|_| rng.gen_bool(0.5)) // 50% chance Option1 = A
        .collect();

    ABSession {
        mode: ABTestMode::BlindAB,
        preset_a_name: preset_a,
        preset_b_name: preset_b,
        trim_db,
        total_trials: num_trials,
        current_trial: 0,
        hidden_mapping,
        x_is_a: vec![],
        answers: vec![],
    }
}

ABX Mode: X is randomly set to A or B for each trial:

pub fn create_abx_session(
    preset_a: String,
    preset_b: String,
    num_trials: usize,
    trim_db: f32,
) -> ABSession {
    use rand::Rng;
    let mut rng = rand::thread_rng();

    // Randomize X for each trial
    let x_is_a: Vec<bool> = (0..num_trials)
        .map(|_| rng.gen_bool(0.5)) // 50% chance X = A
        .collect();

    ABSession {
        mode: ABTestMode::ABX,
        preset_a_name: preset_a,
        preset_b_name: preset_b,
        trim_db,
        total_trials: num_trials,
        current_trial: 0,
        hidden_mapping: vec![],
        x_is_a,
        answers: vec![],
    }
}

Critical Rule: Randomize PER TRIAL, not once for all trials!

❌ Wrong:

let option1_is_a = rng.gen_bool(0.5);
// Use same mapping for all trials

✅ Correct:

let hidden_mapping: Vec<bool> = (0..num_trials)
    .map(|_| rng.gen_bool(0.5))
    .collect();

Loudness Compensation (Trim Parameter)

Problem: Louder = perceived as "better" (Fletcher-Munson curves)

Solution: Level-match presets before testing

Auto-Calculate Trim

pub fn calculate_auto_trim(
    bands_a: &[ParametricBand],
    preamp_a: f32,
    bands_b: &[ParametricBand],
    preamp_b: f32,
) -> f32 {
    use crate::audio_math::calculate_peak_gain;

    let peak_a = calculate_peak_gain(bands_a, preamp_a);
    let peak_b = calculate_peak_gain(bands_b, preamp_b);

    // Adjust B to match A's peak level
    peak_a - peak_b
}

Apply Trim to Preset B

pub fn apply_preset_with_trim(
    bands: &[ParametricBand],
    preamp: f32,
    trim_db: f32,
) -> Result<(), String> {
    let adjusted_preamp = preamp + trim_db;

    // Apply to EqualizerAPO
    write_eapo_config(bands, adjusted_preamp)?;

    Ok(())
}

Example:

Preset A: Peak gain = -2 dB
Preset B: Peak gain = +1 dB
Trim = -2 - (+1) = -3 dB

Apply Preset B with -3 dB trim → Both have -2 dB peak

Statistical Analysis

Preference Analysis (BlindAB)

Count how many times each preset was preferred:

pub struct PreferenceResults {
    pub a_selected: usize,
    pub b_selected: usize,
    pub total_trials: usize,
    pub a_percentage: f64,
    pub b_percentage: f64,
    pub p_value: f64,  // Statistical significance
}

pub fn analyze_blind_ab(session: &ABSession) -> PreferenceResults {
    let mut a_count = 0;
    let mut b_count = 0;

    for (i, answer) in session.answers.iter().enumerate() {
        let option1_is_a = session.hidden_mapping[i];

        let selected_a = match answer.selected_option.as_str() {
            "1" => option1_is_a,
            "2" => !option1_is_a,
            _ => continue,
        };

        if selected_a {
            a_count += 1;
        } else {
            b_count += 1;
        }
    }

    let total = a_count + b_count;
    let a_pct = (a_count as f64 / total as f64) * 100.0;
    let b_pct = (b_count as f64 / total as f64) * 100.0;

    // Binomial test: is this significantly different from 50/50?
    let p_value = binomial_test(a_count, total, 0.5);

    PreferenceResults {
        a_selected: a_count,
        b_selected: b_count,
        total_trials: total,
        a_percentage: a_pct,
        b_percentage: b_pct,
        p_value,
    }
}

ABX Analysis (Audibility Test)

Count correct vs incorrect identifications:

pub struct ABXResults {
    pub correct: usize,
    pub incorrect: usize,
    pub total_trials: usize,
    pub accuracy: f64,
    pub p_value: f64,
}

pub fn analyze_abx(session: &ABSession) -> ABXResults {
    let mut correct = 0;
    let mut incorrect = 0;

    for (i, answer) in session.answers.iter().enumerate() {
        let x_is_a = session.x_is_a[i];

        let guessed_a = match answer.selected_option.as_str() {
            "A" => true,
            "B" => false,
            _ => continue,
        };

        if guessed_a == x_is_a {
            correct += 1;
        } else {
            incorrect += 1;
        }
    }

    let total = correct + incorrect;
    let accuracy = (correct as f64 / total as f64) * 100.0;

    // Binomial test: is this better than 50% guessing?
    let p_value = binomial_test(correct, total, 0.5);

    ABXResults {
        correct,
        incorrect,
        total_trials: total,
        accuracy,
        p_value,
    }
}

Binomial Test (P-Value)

Null Hypothesis: User is guessing randomly (50% chance)

P-Value: Probability of seeing this result (or more extreme) by chance

fn binomial_test(successes: usize, trials: usize, p_null: f64) -> f64 {
    use statrs::distribution::{Binomial, Discrete};

    let dist = Binomial::new(p_null, trials as u64).unwrap();

    // Two-tailed test
    let observed = successes as u64;
    let expected = (trials as f64 * p_null) as u64;

    let p_observed = dist.pmf(observed);
    let mut p_value = p_observed;

    // Add probabilities of more extreme outcomes
    for k in 0..=trials as u64 {
        let p_k = dist.pmf(k);
        if p_k <= p_observed && k != observed {
            p_value += p_k;
        }
    }

    p_value.min(1.0)
}

Interpretation:

  • p < 0.05: Significant - unlikely to be chance (95% confidence)
  • p < 0.01: Highly significant - very unlikely to be chance (99% confidence)
  • p >= 0.05: Not significant - could be random guessing

Example:

ABX Test: 15/20 correct (75% accuracy)
P-value = 0.041

Interpretation: Statistically significant at 95% level.
User can reliably hear the difference.

Sample Size Requirements

How many trials needed for reliable results?

Rule of Thumb:

  • Small effect: 50+ trials
  • Medium effect: 20-30 trials
  • Large effect: 10-15 trials

Formula (ABX test, 80% power):

n = (Z_α/2 + Z_β)² * p(1-p) / (p - 0.5)²

Where:
- Z_α/2 = 1.96 (for α = 0.05, two-tailed)
- Z_β = 0.84 (for 80% power)
- p = expected accuracy

Example:

Expected accuracy: 70%
n = (1.96 + 0.84)² * 0.7 * 0.3 / (0.7 - 0.5)²
n ≈ 41 trials

Recommended Trial Counts

pub fn recommended_trial_count(expected_accuracy: f64) -> usize {
    if expected_accuracy <= 0.55 {
        100 // Very subtle difference
    } else if expected_accuracy <= 0.65 {
        50  // Small difference
    } else if expected_accuracy <= 0.75 {
        25  // Medium difference
    } else {
        15  // Large difference
    }
}

Results Export

CSV Format

pub fn export_to_csv(session: &ABSession) -> String {
    let mut csv = String::from("Trial,Option1,Option2,Selected,Timestamp\n");

    for (i, answer) in session.answers.iter().enumerate() {
        let (opt1, opt2) = if session.mode == ABTestMode::BlindAB {
            if session.hidden_mapping[i] {
                (&session.preset_a_name, &session.preset_b_name)
            } else {
                (&session.preset_b_name, &session.preset_a_name)
            }
        } else {
            ("A", "B")
        };

        csv.push_str(&format!(
            "{},{},{},{},{}\n",
            i + 1,
            opt1,
            opt2,
            answer.selected_option,
            answer.timestamp
        ));
    }

    csv
}

Output:

Trial,Option1,Option2,Selected,Timestamp
1,Flat,Boosted,1,1234
2,Boosted,Flat,2,2456
3,Flat,Boosted,1,3789

JSON Format

pub fn export_to_json(
    session: &ABSession,
    results: &PreferenceResults,
) -> String {
    let export = serde_json::json!({
        "mode": session.mode,
        "presets": {
            "a": session.preset_a_name,
            "b": session.preset_b_name,
        },
        "trim_db": session.trim_db,
        "trials": session.total_trials,
        "results": {
            "a_selected": results.a_selected,
            "b_selected": results.b_selected,
            "a_percentage": results.a_percentage,
            "b_percentage": results.b_percentage,
            "p_value": results.p_value,
            "significant": results.p_value < 0.05,
        },
        "answers": session.answers,
    });

    serde_json::to_string_pretty(&export).unwrap()
}

Experimental Design Best Practices

1. Counterbalancing

Ensure equal distribution of A and B across trials:

pub fn validate_counterbalancing(hidden_mapping: &[bool]) -> f64 {
    let a_count = hidden_mapping.iter().filter(|&&x| x).count();
    let total = hidden_mapping.len();
    let ratio = a_count as f64 / total as f64;

    // Should be close to 0.5
    (ratio - 0.5).abs()
}

Warning threshold:

if validate_counterbalancing(&session.hidden_mapping) > 0.15 {
    println!("Warning: Unbalanced randomization (>15% deviation from 50/50)");
}

2. Trial Independence

Each trial should be independent:

  • ✅ Randomize per trial
  • ❌ Use patterns (ABABAB...)
  • ❌ Fixed order

3. Rest Breaks

Prevent listener fatigue:

if (currentTrial % 10 === 0 && currentTrial !== totalTrials) {
  showRestBreakDialog();
}

4. Reference Switching

Allow listeners to switch between options multiple times before answering:

let switchCount = 0;

function handleSwitch() {
  switchCount++;
  applyOpposite();
}

// Log switch count as quality metric

Common Pitfalls

❌ Volume Mismatch

// WRONG: Apply presets without level matching
applyPresetA();
applyPresetB();

// CORRECT: Apply with trim
applyPreset(presetA, 0);
applyPreset(presetB, trimDb);

❌ Non-Random Patterns

// WRONG: Alternating pattern
let hidden_mapping = vec![true, false, true, false, ...];

// CORRECT: True randomization
let hidden_mapping: Vec<bool> = (0..trials)
    .map(|_| rng.gen_bool(0.5))
    .collect();

❌ Ignoring P-Value

// WRONG: Report raw percentages without significance
"Preset A preferred 55% of the time"

// CORRECT: Include statistical context
"Preset A preferred 55% (p=0.42, not significant)"

❌ Too Few Trials

// WRONG: Only 5 trials
const trials = 5; // Unreliable!

// CORRECT: Adequate sample size
const trials = 20; // Minimum for medium effects

Validation Tests

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_randomization_distribution() {
        let session = create_blind_ab_session("A".into(), "B".into(), 1000, 0.0);

        let a_count = session.hidden_mapping.iter().filter(|&&x| x).count();
        let ratio = a_count as f64 / 1000.0;

        // With 1000 trials, should be very close to 0.5
        assert!((ratio - 0.5).abs() < 0.05, "Randomization biased: {}", ratio);
    }

    #[test]
    fn test_trial_independence() {
        let session = create_blind_ab_session("A".into(), "B".into(), 100, 0.0);

        // Count runs (consecutive same values)
        let mut runs = 1;
        for i in 1..session.hidden_mapping.len() {
            if session.hidden_mapping[i] != session.hidden_mapping[i - 1] {
                runs += 1;
            }
        }

        // Expected runs ≈ n/2 for random data
        let expected_runs = 50.0;
        let deviation = (runs as f64 - expected_runs).abs() / expected_runs;

        assert!(deviation < 0.3, "Trials may not be independent");
    }

    #[test]
    fn test_binomial_test() {
        // 20/20 correct should be highly significant
        let p = binomial_test(20, 20, 0.5);
        assert!(p < 0.001);

        // 10/20 correct should not be significant (random guessing)
        let p = binomial_test(10, 20, 0.5);
        assert!(p > 0.05);
    }
}

Reference Materials

  • references/statistical_tests.md - Detailed statistical methods
  • references/experimental_design.md - Best practices for audio testing
  • references/sample_size_calculator.md - Power analysis formulas

Related Skills

Team Composition Analysis

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

artdesign

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Senior Data Scientist

World-class data science skill for statistical modeling, experimentation, causal inference, and advanced analytics. Expertise in Python (NumPy, Pandas, Scikit-learn), R, SQL, statistical methods, A/B testing, time series, and business intelligence. Includes experiment design, feature engineering, model evaluation, and stakeholder communication. Use when designing experiments, building predictive models, performing causal analysis, or driving data-driven decisions.

designtestingdata

Hypogenic

Automated hypothesis generation and testing using large language models. Use this skill when generating scientific hypotheses from datasets, combining literature insights with empirical data, testing hypotheses against observational data, or conducting systematic hypothesis exploration for research discovery in domains like deception detection, AI content detection, mental health analysis, or other empirical research tasks.

testingdata

Mermaid Diagrams

Comprehensive guide for creating software diagrams using Mermaid syntax. Use when users need to create, visualize, or document software through diagrams including class diagrams (domain modeling, object-oriented design), sequence diagrams (application flows, API interactions, code execution), flowcharts (processes, algorithms, user journeys), entity relationship diagrams (database schemas), C4 architecture diagrams (system context, containers, components), state diagrams, git graphs, pie charts,

artdesigncode

Ux Researcher Designer

UX research and design toolkit for Senior UX Designer/Researcher including data-driven persona generation, journey mapping, usability testing frameworks, and research synthesis. Use for user research, persona creation, journey mapping, and design validation.

designtestingtool

Supabase Postgres Best Practices

Postgres performance optimization and best practices from Supabase. Use this skill when writing, reviewing, or optimizing Postgres queries, schema designs, or database configurations.

designdata

Hypogenic

Automated LLM-driven hypothesis generation and testing on tabular datasets. Use when you want to systematically explore hypotheses about patterns in empirical data (e.g., deception detection, content analysis). Combines literature insights with data-driven hypothesis testing. For manual hypothesis formulation use hypothesis-generation; for creative ideation use scientific-brainstorming.

testingdata

Skill Information

Category:Creative
Last Updated:1/5/2026