Statistical Analysis

by benchflow-ai

designtesting

Probability, distributions, hypothesis testing, and statistical inference. Use for A/B testing, experimental design, or statistical validation.

Skill Details

Repository Files

4 files in this skill directory


name: statistical-analysis description: Probability, distributions, hypothesis testing, and statistical inference. Use for A/B testing, experimental design, or statistical validation. sasmp_version: "1.3.0" bonded_agent: 02-mathematics-statistics bond_type: PRIMARY_BOND

Statistical Analysis

Apply statistical methods to understand data and validate findings.

Quick Start

from scipy import stats
import numpy as np

# Descriptive statistics
data = np.array([1, 2, 3, 4, 5])
print(f"Mean: {np.mean(data)}")
print(f"Std: {np.std(data)}")

# Hypothesis testing
group1 = [23, 25, 27, 29, 31]
group2 = [20, 22, 24, 26, 28]
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"P-value: {p_value}")

Core Tests

T-Test (Compare Means)

# One-sample: Compare to population mean
stats.ttest_1samp(data, 100)

# Two-sample: Compare two groups
stats.ttest_ind(group1, group2)

# Paired: Before/after comparison
stats.ttest_rel(before, after)

Chi-Square (Categorical Data)

from scipy.stats import chi2_contingency

observed = np.array([[10, 20], [15, 25]])
chi2, p_value, dof, expected = chi2_contingency(observed)

ANOVA (Multiple Groups)

f_stat, p_value = stats.f_oneway(group1, group2, group3)

Confidence Intervals

from scipy import stats

confidence_level = 0.95
mean = np.mean(data)
se = stats.sem(data)
ci = stats.t.interval(confidence_level, len(data)-1, mean, se)

print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")

Correlation

# Pearson (linear)
r, p_value = stats.pearsonr(x, y)

# Spearman (rank-based)
rho, p_value = stats.spearmanr(x, y)

Distributions

# Normal
x = np.linspace(-3, 3, 100)
pdf = stats.norm.pdf(x, loc=0, scale=1)

# Sampling
samples = np.random.normal(0, 1, 1000)

# Test normality
stat, p_value = stats.shapiro(data)

A/B Testing Framework

def ab_test(control, treatment, alpha=0.05):
    """
    Run A/B test with statistical significance

    Returns: significant (bool), p_value (float)
    """
    t_stat, p_value = stats.ttest_ind(control, treatment)

    significant = p_value < alpha
    improvement = (np.mean(treatment) - np.mean(control)) / np.mean(control) * 100

    return {
        'significant': significant,
        'p_value': p_value,
        'improvement': f"{improvement:.2f}%"
    }

Interpretation

P-value < 0.05: Reject null hypothesis (statistically significant)

P-value >= 0.05: Fail to reject null (not significant)

Common Pitfalls

  • Multiple testing without correction
  • Small sample sizes
  • Ignoring assumptions (normality, independence)
  • Confusing correlation with causation
  • p-hacking (searching for significance)

Troubleshooting

Common Issues

Problem: Non-normal data for t-test

# Check normality first
stat, p = stats.shapiro(data)
if p < 0.05:
    # Use non-parametric alternative
    stat, p = stats.mannwhitneyu(group1, group2)  # Instead of ttest_ind

Problem: Multiple comparisons inflating false positives

from statsmodels.stats.multitest import multipletests

# Apply Bonferroni correction
p_values = [0.01, 0.03, 0.04, 0.02, 0.06]
rejected, p_adjusted, _, _ = multipletests(p_values, method='bonferroni')

Problem: Underpowered study (sample too small)

from statsmodels.stats.power import TTestIndPower

# Calculate required sample size
power_analysis = TTestIndPower()
sample_size = power_analysis.solve_power(
    effect_size=0.5,  # Medium effect (Cohen's d)
    power=0.8,        # 80% power
    alpha=0.05        # 5% significance
)
print(f"Required n per group: {sample_size:.0f}")

Problem: Heterogeneous variances

# Check with Levene's test
stat, p = stats.levene(group1, group2)
if p < 0.05:
    # Use Welch's t-test (default in scipy)
    t, p = stats.ttest_ind(group1, group2, equal_var=False)

Problem: Outliers affecting results

from scipy.stats import zscore

# Detect outliers (|z| > 3)
z_scores = np.abs(zscore(data))
clean_data = data[z_scores < 3]

# Or use robust statistics
median = np.median(data)
mad = np.median(np.abs(data - median))  # Median Absolute Deviation

Debug Checklist

  • Check sample size adequacy (power analysis)
  • Test normality assumption (Shapiro-Wilk)
  • Test homogeneity of variance (Levene's)
  • Check for outliers (z-scores, IQR)
  • Apply multiple testing correction if needed
  • Report effect sizes, not just p-values

Related Skills

Team Composition Analysis

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

artdesign

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Senior Data Scientist

World-class data science skill for statistical modeling, experimentation, causal inference, and advanced analytics. Expertise in Python (NumPy, Pandas, Scikit-learn), R, SQL, statistical methods, A/B testing, time series, and business intelligence. Includes experiment design, feature engineering, model evaluation, and stakeholder communication. Use when designing experiments, building predictive models, performing causal analysis, or driving data-driven decisions.

designtestingdata

Hypogenic

Automated hypothesis generation and testing using large language models. Use this skill when generating scientific hypotheses from datasets, combining literature insights with empirical data, testing hypotheses against observational data, or conducting systematic hypothesis exploration for research discovery in domains like deception detection, AI content detection, mental health analysis, or other empirical research tasks.

testingdata

Mermaid Diagrams

Comprehensive guide for creating software diagrams using Mermaid syntax. Use when users need to create, visualize, or document software through diagrams including class diagrams (domain modeling, object-oriented design), sequence diagrams (application flows, API interactions, code execution), flowcharts (processes, algorithms, user journeys), entity relationship diagrams (database schemas), C4 architecture diagrams (system context, containers, components), state diagrams, git graphs, pie charts,

artdesigncode

Ux Researcher Designer

UX research and design toolkit for Senior UX Designer/Researcher including data-driven persona generation, journey mapping, usability testing frameworks, and research synthesis. Use for user research, persona creation, journey mapping, and design validation.

designtestingtool

Supabase Postgres Best Practices

Postgres performance optimization and best practices from Supabase. Use this skill when writing, reviewing, or optimizing Postgres queries, schema designs, or database configurations.

designdata

Hypogenic

Automated LLM-driven hypothesis generation and testing on tabular datasets. Use when you want to systematically explore hypotheses about patterns in empirical data (e.g., deception detection, content analysis). Combines literature insights with data-driven hypothesis testing. For manual hypothesis formulation use hypothesis-generation; for creative ideation use scientific-brainstorming.

testingdata

Skill Information

Category:Creative
Last Updated:1/22/2026