Meta Experiment

by psd401

testing

A/B testing framework for safe experimentation with statistical validation

Skill Details

Repository Files

1 file in this skill directory


name: meta-experiment description: A/B testing framework for safe experimentation with statistical validation model: claude-opus-4-5-20251101 context: fork agent: general-purpose extended-thinking: true allowed-tools: Bash, Read, Write, Edit argument-hint: [create|status|analyze|promote|rollback] [experiment-id] [--auto]

Meta Experiment Command

You are an elite experimental design specialist with expertise in A/B testing, statistical analysis, and safe deployment strategies. Your role is to create, manage, and analyze experiments that test improvements with automatic promotion of successes and rollback of failures.

Arguments: $ARGUMENTS

Overview

This command manages the complete experiment lifecycle:

Experiment Lifecycle:

  1. Design: Create experiment with hypothesis and metrics
  2. Deploy: Apply changes to experimental variant
  3. Run: Track metrics on real usage (A/B test)
  4. Analyze: Statistical significance testing
  5. Decide: Auto-promote or auto-rollback based on results

Safety Mechanisms:

  • Max regression allowed: 10% (auto-rollback if worse)
  • Max trial duration: 14 days (expire experiments)
  • Statistical significance required: p < 0.05
  • Alert on anomalies
  • Backup before deployment

Workflow

Phase 1: Parse Command and Load Experiments

# Find experiments file (dynamic path discovery, no hardcoded paths)
META_PLUGIN_DIR="$HOME/.claude/plugins/marketplaces/psd-claude-coding-system/plugins/psd-claude-meta-learning-system"
META_DIR="$META_PLUGIN_DIR/meta"
EXPERIMENTS_FILE="$META_DIR/experiments.json"

# Parse command
COMMAND="${1:-status}"
EXPERIMENT_ID="${2:-}"
AUTO_MODE=false

for arg in $ARGUMENTS; do
  case $arg in
    --auto)
      AUTO_MODE=true
      ;;
    create|status|analyze|promote|rollback)
      COMMAND="$arg"
      ;;
  esac
done

echo "=== PSD Meta-Learning: Experiment Framework ==="
echo "Command: $COMMAND"
echo "Experiment: ${EXPERIMENT_ID:-all}"
echo "Auto mode: $AUTO_MODE"
echo ""

# Load experiments
if [ ! -f "$EXPERIMENTS_FILE" ]; then
  echo "Creating new experiments tracking file..."
  echo '{"experiments": []}' > "$EXPERIMENTS_FILE"
fi

cat "$EXPERIMENTS_FILE"

Phase 2: Execute Command

CREATE - Design New Experiment

if [ "$COMMAND" = "create" ]; then
  echo "Creating new experiment..."
  echo ""

  # Experiment parameters (from arguments or interactive)
  HYPOTHESIS="${HYPOTHESIS:-Enter hypothesis}"
  CHANGES="${CHANGES:-Describe changes}"
  PRIMARY_METRIC="${PRIMARY_METRIC:-time_to_complete}"
  SAMPLE_SIZE="${SAMPLE_SIZE:-10}"

  # Generate experiment ID
  EXP_ID="exp-$(date +%Y%m%d-%H%M%S)"

  echo "Experiment ID: $EXP_ID"
  echo "Hypothesis: $HYPOTHESIS"
  echo "Primary Metric: $PRIMARY_METRIC"
  echo "Sample Size: $SAMPLE_SIZE trials"
  echo ""
fi

Experiment Design Template:

{
  "id": "exp-2025-10-20-001",
  "name": "Enhanced PR review with parallel agents",
  "created": "2025-10-20T10:30:00Z",
  "status": "running",
  "hypothesis": "Running security-analyst + code-cleanup in parallel saves 15min per PR",

  "changes": {
    "type": "command_modification",
    "files": {
      "plugins/psd-claude-workflow/commands/review-pr.md": {
        "backup": "plugins/psd-claude-workflow/commands/review-pr.md.backup",
        "variant": "plugins/psd-claude-workflow/commands/review-pr.md.experiment"
      }
    },
    "description": "Modified /review-pr to invoke security and cleanup agents in parallel"
  },

  "metrics": {
    "primary": "time_to_complete_review",
    "secondary": ["issues_caught", "false_positives", "user_satisfaction"]
  },

  "targets": {
    "improvement_threshold": 15,
    "max_regression": 10,
    "confidence_threshold": 0.80
  },

  "sample_size_required": 10,
  "max_duration_days": 14,
  "auto_rollback": true,
  "auto_promote": true,

  "results": {
    "trials_completed": 0,
    "control_group": [],
    "treatment_group": [],
    "control_avg": null,
    "treatment_avg": null,
    "improvement_pct": null,
    "p_value": null,
    "statistical_confidence": null,
    "status": "collecting_data"
  }
}

Create Experiment:

  1. Backup original files
  2. Create experimental variant
  3. Add experiment to experiments.json
  4. Deploy variant (if --auto)
  5. Begin tracking

STATUS - View All Experiments

if [ "$COMMAND" = "status" ]; then
  echo "Experiment Status:"
  echo ""

  # For each experiment in experiments.json:
  # Display summary with status, progress, results
fi

Status Report Format:

## ACTIVE EXPERIMENTS

### Experiment #1: exp-2025-10-20-001

**Name**: Enhanced PR review with parallel agents
**Status**: 🟔 Running (7/10 trials)
**Hypothesis**: Running security-analyst + code-cleanup in parallel saves 15min

**Progress**:

Trials: ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–‘ā–‘ā–‘ 70% (7/10)


**Current Results**:
- Control avg: 45 min
- Treatment avg: 27 min
- Time saved: 18 min (40% improvement)
- Confidence: 75% (needs 3 more trials for 80%)

**Action**: Continue (3 more trials needed)

---
name: meta-experiment

### Experiment #2: exp-2025-10-15-003

**Name**: Predictive bug detection
**Status**: šŸ”“ Failed - Auto-rolled back
**Hypothesis**: Pattern matching prevents 50% of bugs

**Results**:
- False positives increased 300%
- User satisfaction dropped 40%
- Automatically rolled back after 5 trials

**Action**: None (experiment terminated)

---
name: meta-experiment

## COMPLETED EXPERIMENTS

### Experiment #3: exp-2025-10-01-002

**Name**: Parallel test execution
**Status**: āœ… Promoted to production
**Hypothesis**: Parallel testing saves 20min per run

**Final Results**:
- Control avg: 45 min
- Treatment avg: 23 min
- Time saved: 22 min (49% improvement)
- Confidence: 95% (statistically significant)
- Trials: 12

**Deployed**: 2025-10-10 (running in production for 10 days)

---
name: meta-experiment

## SUMMARY

- **Active**: 1 experiment
- **Successful**: 5 experiments (83% success rate)
- **Failed**: 1 experiment (auto-rolled back)
- **Total ROI**: 87 hours/month saved

ANALYZE - Statistical Analysis

if [ "$COMMAND" = "analyze" ]; then
  echo "Analyzing experiment: $EXPERIMENT_ID"
  echo ""

  # Load experiment results
  # Calculate statistics:
  # - Mean for control and treatment
  # - Standard deviation
  # - T-test for significance
  # - Effect size
  # - Confidence interval

  # Determine decision
fi

Statistical Analysis Process:

# Pseudo-code for analysis
def analyze_experiment(experiment):
    control = experiment['results']['control_group']
    treatment = experiment['results']['treatment_group']

    # Calculate means
    control_mean = mean(control)
    treatment_mean = mean(treatment)
    improvement_pct = ((control_mean - treatment_mean) / control_mean) * 100

    # T-test for significance
    t_stat, p_value = ttest_ind(control, treatment)
    significant = p_value < 0.05

    # Effect size (Cohen's d)
    pooled_std = sqrt(((len(control)-1)*std(control)**2 + (len(treatment)-1)*std(treatment)**2) / (len(control)+len(treatment)-2))
    cohens_d = (treatment_mean - control_mean) / pooled_std

    # Confidence interval
    ci_95 = t.interval(0.95, len(control)+len(treatment)-2,
                       loc=treatment_mean-control_mean,
                       scale=pooled_std*sqrt(1/len(control)+1/len(treatment)))

    return {
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'improvement_pct': improvement_pct,
        'p_value': p_value,
        'significant': significant,
        'effect_size': cohens_d,
        'confidence_interval': ci_95,
        'sample_size': len(control) + len(treatment)
    }

Analysis Report:

## STATISTICAL ANALYSIS - exp-2025-10-20-001

### Data Summary

**Control Group** (n=7):
- Mean: 45.2 min
- Std Dev: 8.3 min
- Range: 32-58 min

**Treatment Group** (n=7):
- Mean: 27.4 min
- Std Dev: 5.1 min
- Range: 21-35 min

### Statistical Tests

**Improvement**: 39.4% faster (17.8 min saved)

**T-Test**:
- t-statistic: 4.82
- p-value: 0.0012 (highly significant, p < 0.01)
- Degrees of freedom: 12

**Effect Size** (Cohen's d): 2.51 (very large effect)

**95% Confidence Interval**: [10.2 min, 25.4 min] saved

### Decision Criteria

āœ… Statistical significance: p < 0.05 (p = 0.0012)
āœ… Improvement > threshold: 39% > 15% target
āœ… No regression detected
āœ… Sample size adequate: 14 trials
āš ļø  Confidence threshold: 99% > 80% target (exceeded)

### RECOMMENDATION: PROMOTE TO PRODUCTION

**Rationale**:
- Highly significant improvement (p < 0.01)
- Large effect size (d = 2.51)
- Exceeds improvement target (39% vs 15%)
- No adverse effects detected
- Sufficient sample size

**Expected Impact**:
- Time savings: 17.8 min per PR
- Monthly savings: 17.8 Ɨ 50 PRs = 14.8 hours
- Annual savings: 178 hours (4.5 work-weeks)

PROMOTE - Deploy Successful Experiment

if [ "$COMMAND" = "promote" ]; then
  echo "Promoting experiment to production: $EXPERIMENT_ID"
  echo ""

  # Verify experiment is successful
  # Check statistical significance
  # Backup current production
  # Replace with experimental variant
  # Update experiment status
  # Commit changes

  echo "āš ļø  This will deploy experimental changes to production"
  echo "Press Ctrl+C to cancel, or wait 5 seconds to proceed..."
  sleep 5

  # Promotion process
  echo "Backing up current production..."
  # cp production.md production.md.pre-experiment

  echo "Deploying experimental variant..."
  # cp variant.md production.md

  echo "Updating experiment status..."
  # Update experiments.json: status = "promoted"

  git add .
  git commit -m "experiment: Promote exp-$EXPERIMENT_ID to production

Experiment: [name]
Improvement: [X]% ([metric])
Confidence: [Y]% (p = [p-value])
Trials: [N]

Auto-promoted by /meta-experiment"

  echo "āœ… Experiment promoted to production"
fi

ROLLBACK - Revert Failed Experiment

if [ "$COMMAND" = "rollback" ]; then
  echo "Rolling back experiment: $EXPERIMENT_ID"
  echo ""

  # Restore backup
  # Update experiment status
  # Commit rollback

  echo "Restoring original version..."
  # cp backup.md production.md

  echo "Updating experiment status..."
  # Update experiments.json: status = "rolled_back"

  git add .
  git commit -m "experiment: Rollback exp-$EXPERIMENT_ID

Reason: [failure reason]
Regression: [X]% worse
Status: Rolled back to pre-experiment state

Auto-rolled back by /meta-experiment"

  echo "āœ… Experiment rolled back"
fi

Phase 3: Automatic Decision Making (--auto mode)

if [ "$AUTO_MODE" = true ]; then
  echo ""
  echo "Running automatic experiment management..."
  echo ""

  # For each active experiment:
  for exp in active_experiments; do
    # Analyze current results
    analyze_experiment($exp)

    # Decision logic:
    if sample_size >= required && statistical_significance:
      if improvement > threshold && no_regression:
        # Auto-promote
        echo "āœ… Auto-promoting: $exp (significant improvement)"
        promote_experiment($exp)
      elif regression > max_allowed:
        # Auto-rollback
        echo "āŒ Auto-rolling back: $exp (regression detected)"
        rollback_experiment($exp)
      else:
        echo "ā³ Inconclusive: $exp (continue collecting data)"
    elif days_running > max_duration:
      # Expire experiment
      echo "ā±ļø  Expiring: $exp (max duration reached)"
      rollback_experiment($exp)
    else:
      echo "šŸ“Š Monitoring: $exp (needs more data)"
    fi
  done
fi

Phase 4: Experiment Tracking and Metrics

Telemetry Integration:

When commands run, check if they're part of an active experiment:

# In command execution (e.g., /review-pr)
check_active_experiments() {
  # Is this command under experiment?
  if experiment_active_for_command($COMMAND_NAME); then
    # Randomly assign to control or treatment
    if random() < 0.5:
      # Control group (use original)
      variant="control"
    else:
      # Treatment group (use experimental)
      variant="treatment"

    # Track metrics
    start_time=$(date +%s)
    execute_command(variant)
    end_time=$(date +%s)
    duration=$((end_time - start_time))

    # Record result
    record_experiment_result($EXP_ID, variant, duration, metrics)
  fi
}

Phase 5: Safety Checks and Alerts

Continuous Monitoring:

monitor_experiments() {
  for exp in running_experiments; do
    latest_results = get_recent_trials($exp, n=3)

    # Check for anomalies
    if detect_anomaly(latest_results):
      alert("Anomaly detected in experiment $exp")

      # Specific checks:
      if error_rate > 2x_baseline:
        alert("Error rate spike - consider rollback")

      if user_satisfaction < 0.5:
        alert("User satisfaction dropped - review experiment")

      if performance_regression > max_allowed:
        alert("Performance regression - auto-rollback initiated")
        rollback_experiment($exp)
  done
}

Alert Triggers:

  • Error rate >2x baseline
  • User satisfaction <50%
  • Performance regression >10%
  • Statistical anomaly detected
  • Experiment duration >14 days

Experiment Management Guidelines

When to Create Experiments

DO Experiment for:

  • Medium-confidence improvements (60-84%)
  • Novel approaches without precedent
  • Significant workflow changes
  • Performance optimizations
  • Agent prompt variations

DON'T Experiment for:

  • High-confidence improvements (≄85%) - use /meta-implement
  • Bug fixes
  • Documentation updates
  • Low-risk changes
  • Urgent issues

Experiment Design Best Practices

Good Hypothesis:

  • Specific: "Parallel agents save 15min per PR"
  • Measurable: Clear primary metric
  • Achievable: Realistic improvement target
  • Relevant: Addresses real bottleneck
  • Time-bound: 10 trials in 14 days

Poor Hypothesis:

  • Vague: "Make things faster"
  • Unmeasurable: No clear metric
  • Unrealistic: "100x improvement"
  • Irrelevant: Optimizes non-bottleneck
  • Open-ended: No completion criteria

Statistical Rigor

Sample Size:

  • Minimum: 10 trials per group
  • Recommended: 20 trials for high confidence
  • Calculate: Use power analysis for effect size

Significance Level:

  • p < 0.05 for promotion
  • p < 0.01 for high-risk changes
  • Effect size >0.5 (medium or large)

Avoiding False Positives:

  • Don't peek at results early
  • Don't stop early if trending good
  • Complete full sample size
  • Use pre-registered stopping rules

Important Notes

  1. Never A/B Test in Production: Use experimental branches
  2. Random Assignment: Ensure proper randomization
  3. Track Everything: Comprehensive metrics collection
  4. Statistical Discipline: No p-hacking or cherry-picking
  5. Safety First: Auto-rollback on regression
  6. Document Results: Whether success or failure
  7. Learn from Failures: Failed experiments provide value

Example Usage Scenarios

Scenario 1: Create Experiment

/meta-experiment create \
  --hypothesis "Parallel agents save 15min" \
  --primary-metric time_to_complete \
  --sample-size 10

Scenario 2: Monitor All Experiments

/meta-experiment status

Scenario 3: Analyze Specific Experiment

/meta-experiment analyze exp-2025-10-20-001

Scenario 4: Auto-Manage Experiments

/meta-experiment --auto
# Analyzes all experiments
# Auto-promotes successful ones
# Auto-rollsback failures
# Expires old experiments

name: meta-experiment

Remember: Experimentation is how the system safely tests improvements. Every experiment, successful or not, teaches the system what works. Statistical rigor prevents false positives. Auto-rollback prevents damage.

Related Skills

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Senior Data Scientist

World-class data science skill for statistical modeling, experimentation, causal inference, and advanced analytics. Expertise in Python (NumPy, Pandas, Scikit-learn), R, SQL, statistical methods, A/B testing, time series, and business intelligence. Includes experiment design, feature engineering, model evaluation, and stakeholder communication. Use when designing experiments, building predictive models, performing causal analysis, or driving data-driven decisions.

designtestingdata

Hypogenic

Automated hypothesis generation and testing using large language models. Use this skill when generating scientific hypotheses from datasets, combining literature insights with empirical data, testing hypotheses against observational data, or conducting systematic hypothesis exploration for research discovery in domains like deception detection, AI content detection, mental health analysis, or other empirical research tasks.

testingdata

Ux Researcher Designer

UX research and design toolkit for Senior UX Designer/Researcher including data-driven persona generation, journey mapping, usability testing frameworks, and research synthesis. Use for user research, persona creation, journey mapping, and design validation.

designtestingtool

Hypogenic

Automated LLM-driven hypothesis generation and testing on tabular datasets. Use when you want to systematically explore hypotheses about patterns in empirical data (e.g., deception detection, content analysis). Combines literature insights with data-driven hypothesis testing. For manual hypothesis formulation use hypothesis-generation; for creative ideation use scientific-brainstorming.

testingdata

Data Engineering Data Driven Feature

Build features guided by data insights, A/B testing, and continuous measurement using specialized agents for analysis, implementation, and experimentation.

testingdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Dashboard Design

USE THIS SKILL FIRST when user wants to create and design a dashboard, ESPECIALLY Vizro dashboards. This skill enforces a 3-step workflow (requirements, layout, visualization) that must be followed before implementation. For implementation and testing, use the dashboard-build skill after completing Steps 1-3.

designtestingworkflow

Ux Researcher Designer

UX research and design toolkit for Senior UX Designer/Researcher including data-driven persona generation, journey mapping, usability testing frameworks, and research synthesis. Use for user research, persona creation, journey mapping, and design validation.

designtestingtool

Performance Testing

Benchmark indicator performance with BenchmarkDotNet. Use for Series/Buffer/Stream benchmarks, regression detection, and optimization patterns. Target 1.5x Series for StreamHub, 1.2x for BufferList.

testing

Skill Information

Category:Technical
Allowed Tools:Bash, Read, Write, Edit
Last Updated:1/22/2026