Data Science

by eyadsibai

designtestingdata

Use when "statistical modeling", "A/B testing", "experiment design", "causal inference", "predictive modeling", or asking about "hypothesis testing", "feature engineering", "data analysis", "pandas", "scikit-learn

Skill Details

Repository Files

1 file in this skill directory


name: data-science description: Use when "statistical modeling", "A/B testing", "experiment design", "causal inference", "predictive modeling", or asking about "hypothesis testing", "feature engineering", "data analysis", "pandas", "scikit-learn" version: 1.0.0

Data Science Guide

Statistical modeling, experimentation, and advanced analytics.

When to Use

  • Designing A/B tests and experiments
  • Building predictive models
  • Performing causal analysis
  • Feature engineering
  • Statistical hypothesis testing

Tech Stack

Category Tools
Languages Python, SQL, R
Analysis NumPy, Pandas, SciPy
ML Scikit-learn, XGBoost, LightGBM
Visualization Matplotlib, Seaborn, Plotly
Statistics Statsmodels, PyMC
Notebooks Jupyter, VS Code

Experiment Design

A/B Test Framework

import scipy.stats as stats
import numpy as np

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """Calculate required sample size for A/B test."""
    effect_size = mde / np.sqrt(baseline_rate * (1 - baseline_rate))
    analysis = stats.TTestIndPower()
    return int(analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    ))

# Example: 5% baseline, 10% relative lift
n = calculate_sample_size(0.05, 0.005)
print(f"Required sample size per group: {n}")

Statistical Significance

def analyze_ab_test(control, treatment):
    """Analyze A/B test results."""
    # Two-proportion z-test
    n1, n2 = len(control), len(treatment)
    p1, p2 = control.mean(), treatment.mean()
    p_pool = (control.sum() + treatment.sum()) / (n1 + n2)

    se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    z = (p2 - p1) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    return {
        'control_rate': p1,
        'treatment_rate': p2,
        'lift': (p2 - p1) / p1,
        'p_value': p_value,
        'significant': p_value < 0.05
    }

Feature Engineering

Common Patterns

import pandas as pd
from sklearn.preprocessing import StandardScaler

def engineer_features(df):
    """Feature engineering pipeline."""
    # Temporal features
    df['hour'] = df['timestamp'].dt.hour
    df['day_of_week'] = df['timestamp'].dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6])

    # Aggregations
    df['user_avg_spend'] = df.groupby('user_id')['amount'].transform('mean')
    df['user_transaction_count'] = df.groupby('user_id')['amount'].transform('count')

    # Ratios
    df['spend_vs_avg'] = df['amount'] / df['user_avg_spend']

    return df

Feature Selection

from sklearn.feature_selection import mutual_info_classif

def select_features(X, y, k=10):
    """Select top k features by mutual information."""
    mi_scores = mutual_info_classif(X, y)
    top_k = np.argsort(mi_scores)[-k:]
    return X.columns[top_k].tolist()

Model Evaluation

Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold

def evaluate_model(model, X, y):
    """Robust model evaluation."""
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    scores = {
        'accuracy': cross_val_score(model, X, y, cv=cv, scoring='accuracy'),
        'precision': cross_val_score(model, X, y, cv=cv, scoring='precision'),
        'recall': cross_val_score(model, X, y, cv=cv, scoring='recall'),
        'auc': cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
    }

    return {k: f"{v.mean():.3f} (+/- {v.std()*2:.3f})" for k, v in scores.items()}

Causal Inference

Propensity Score Matching

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

def propensity_matching(df, treatment_col, features):
    """Match treatment and control using propensity scores."""
    # Estimate propensity scores
    ps_model = LogisticRegression()
    ps_model.fit(df[features], df[treatment_col])
    df['propensity'] = ps_model.predict_proba(df[features])[:, 1]

    # Match nearest neighbors
    treated = df[df[treatment_col] == 1]
    control = df[df[treatment_col] == 0]

    nn = NearestNeighbors(n_neighbors=1)
    nn.fit(control[['propensity']])
    distances, indices = nn.kneighbors(treated[['propensity']])

    return treated, control.iloc[indices.flatten()]

Best Practices

Analysis Workflow

  1. Define hypothesis clearly
  2. Calculate required sample size
  3. Design experiment (randomization)
  4. Collect data with quality checks
  5. Analyze with appropriate tests
  6. Report with confidence intervals

Common Pitfalls

  • Multiple comparisons without correction
  • Peeking at results before sample size reached
  • Simpson's paradox in aggregations
  • Survivorship bias in cohort analysis
  • Correlation vs causation confusion

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Team Composition Analysis

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

artdesign

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Skill Information

Category:Creative
Version:1.0.0
Last Updated:1/15/2026