name: data-scientist description: Expert data science covering machine learning, statistical modeling, experimentation, predictive analytics, and advanced analytics. version: 1.0.0 author: Claude Skills category: data-analytics tags: [data-science, machine-learning, statistics, modeling, analytics]

Data Scientist

Expert-level data science for business impact.

Core Competencies

Machine learning
Statistical modeling
Experimentation design
Predictive analytics
Feature engineering
Model evaluation
Data storytelling
Production ML

Machine Learning Workflow

PROBLEM DEFINITION → DATA → FEATURES → MODEL → EVALUATION → DEPLOYMENT

1. Problem Definition
   - Business objective
   - Success metrics
   - Constraints

2. Data Collection
   - Data sources
   - Data quality
   - Sample size

3. Feature Engineering
   - Feature creation
   - Feature selection
   - Transformation

4. Model Development
   - Algorithm selection
   - Training
   - Tuning

5. Evaluation
   - Metrics
   - Validation
   - Business impact

6. Deployment
   - Production pipeline
   - Monitoring
   - Iteration

Model Selection

Algorithm Comparison

Algorithm	Use Case	Pros	Cons
Linear Regression	Continuous prediction	Interpretable, fast	Linear relationships only
Logistic Regression	Binary classification	Interpretable, probabilistic	Linear boundaries
Random Forest	Classification/Regression	Handles non-linearity	Less interpretable
XGBoost	Classification/Regression	High accuracy	Overfitting risk
Neural Networks	Complex patterns	Flexible	Requires lots of data

Model Selection Framework

def select_model(problem_type, data_size, interpretability_need, accuracy_need):
    """
    problem_type: 'classification' or 'regression'
    data_size: 'small' (<10K), 'medium' (10K-1M), 'large' (>1M)
    interpretability_need: 'high', 'medium', 'low'
    accuracy_need: 'high', 'medium', 'low'
    """

    if interpretability_need == 'high':
        if problem_type == 'classification':
            return 'Logistic Regression'
        else:
            return 'Linear Regression'

    if data_size == 'small':
        return 'Random Forest'

    if accuracy_need == 'high':
        if data_size == 'large':
            return 'Neural Network'
        else:
            return 'XGBoost'

    return 'Random Forest'

Feature Engineering

Feature Types

# Numerical Features
def engineer_numerical(df, col):
    features = {
        f'{col}_log': np.log1p(df[col]),
        f'{col}_sqrt': np.sqrt(df[col]),
        f'{col}_squared': df[col] ** 2,
        f'{col}_binned': pd.cut(df[col], bins=5, labels=False)
    }
    return pd.DataFrame(features)

# Categorical Features
def engineer_categorical(df, col):
    # One-hot encoding
    dummies = pd.get_dummies(df[col], prefix=col)

    # Target encoding
    target_mean = df.groupby(col)['target'].mean()
    target_encoded = df[col].map(target_mean)

    # Frequency encoding
    freq = df[col].value_counts(normalize=True)
    freq_encoded = df[col].map(freq)

    return dummies, target_encoded, freq_encoded

# Time Features
def engineer_time(df, col):
    df[col] = pd.to_datetime(df[col])
    features = {
        f'{col}_hour': df[col].dt.hour,
        f'{col}_day': df[col].dt.day,
        f'{col}_dayofweek': df[col].dt.dayofweek,
        f'{col}_month': df[col].dt.month,
        f'{col}_is_weekend': df[col].dt.dayofweek.isin([5, 6]).astype(int),
        f'{col}_hour_sin': np.sin(2 * np.pi * df[col].dt.hour / 24),
        f'{col}_hour_cos': np.cos(2 * np.pi * df[col].dt.hour / 24)
    }
    return pd.DataFrame(features)

Feature Selection

from sklearn.feature_selection import mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier

def select_features(X, y, method='importance', n_features=20):
    if method == 'importance':
        model = RandomForestClassifier(n_estimators=100)
        model.fit(X, y)
        importance = pd.Series(model.feature_importances_, index=X.columns)
        return importance.nlargest(n_features).index.tolist()

    elif method == 'mutual_info':
        mi_scores = mutual_info_classif(X, y)
        mi_series = pd.Series(mi_scores, index=X.columns)
        return mi_series.nlargest(n_features).index.tolist()

    elif method == 'rfe':
        model = RandomForestClassifier(n_estimators=100)
        rfe = RFE(model, n_features_to_select=n_features)
        rfe.fit(X, y)
        return X.columns[rfe.support_].tolist()

Model Evaluation

Classification Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix,
    classification_report
)

def evaluate_classification(y_true, y_pred, y_proba=None):
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
    }

    if y_proba is not None:
        metrics['auc_roc'] = roc_auc_score(y_true, y_proba)

    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

    return metrics

Regression Metrics

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error,
    r2_score, mean_absolute_percentage_error
)

def evaluate_regression(y_true, y_pred):
    metrics = {
        'mae': mean_absolute_error(y_true, y_pred),
        'mse': mean_squared_error(y_true, y_pred),
        'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
        'r2': r2_score(y_true, y_pred),
        'mape': mean_absolute_percentage_error(y_true, y_pred)
    }
    return metrics

Experimentation

A/B Test Design

from scipy import stats

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """
    Calculate required sample size per variant

    baseline_rate: Current conversion rate (e.g., 0.05)
    mde: Minimum detectable effect (e.g., 0.1 for 10% lift)
    alpha: Significance level
    power: Statistical power
    """
    effect_size = baseline_rate * mde
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    p = baseline_rate
    p_new = p + effect_size

    n = (2 * p * (1 - p) * (z_alpha + z_beta) ** 2) / (effect_size ** 2)

    return int(np.ceil(n))


def analyze_ab_test(control, treatment, alpha=0.05):
    """
    Analyze A/B test results

    control: array of 0/1 outcomes for control
    treatment: array of 0/1 outcomes for treatment
    """
    n_control = len(control)
    n_treatment = len(treatment)

    p_control = control.mean()
    p_treatment = treatment.mean()

    # Pooled proportion
    p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment)

    # Standard error
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))

    # Z-statistic
    z = (p_treatment - p_control) / se

    # P-value (two-tailed)
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # Confidence interval
    ci_low = (p_treatment - p_control) - 1.96 * se
    ci_high = (p_treatment - p_control) + 1.96 * se

    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'lift': (p_treatment - p_control) / p_control,
        'z_statistic': z,
        'p_value': p_value,
        'significant': p_value < alpha,
        'confidence_interval': (ci_low, ci_high)
    }

Statistical Analysis

Hypothesis Testing

from scipy import stats

# T-test
def compare_means(group1, group2):
    stat, p_value = stats.ttest_ind(group1, group2)
    effect_size = (group1.mean() - group2.mean()) / np.sqrt(
        (group1.std()**2 + group2.std()**2) / 2
    )
    return {'t_statistic': stat, 'p_value': p_value, 'cohens_d': effect_size}

# Chi-square
def test_independence(contingency_table):
    chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
    return {'chi2': chi2, 'p_value': p_value, 'degrees_of_freedom': dof}

# Correlation
def analyze_correlation(x, y):
    pearson_r, pearson_p = stats.pearsonr(x, y)
    spearman_r, spearman_p = stats.spearmanr(x, y)
    return {
        'pearson': {'r': pearson_r, 'p_value': pearson_p},
        'spearman': {'r': spearman_r, 'p_value': spearman_p}
    }

Project Template

# Data Science Project: [Name]

## Business Objective
[What business problem are we solving?]

## Success Metrics
- Primary: [Metric]
- Secondary: [Metric]

## Data
- Sources: [List]
- Size: [Rows/Features]
- Time period: [Dates]

## Methodology
1. [Step 1]
2. [Step 2]

## Results

### Model Performance
| Metric | Value |
|--------|-------|
| [Metric] | [Value] |

### Business Impact
- [Impact 1]
- [Impact 2]

## Recommendations
1. [Recommendation]

## Next Steps
- [Next step]

## Appendix
[Technical details]

Reference Materials

references/ml_algorithms.md - Algorithm deep dives
references/feature_engineering.md - Feature engineering patterns
references/experimentation.md - A/B testing guide
references/statistics.md - Statistical methods

Scripts

# Model trainer
python scripts/train_model.py --config model_config.yaml

# Feature importance analyzer
python scripts/feature_importance.py --model model.pkl --data test.csv

# A/B test analyzer
python scripts/ab_analyzer.py --control control.csv --treatment treatment.csv

# Model evaluator
python scripts/evaluate_model.py --model model.pkl --test test.csv

Data Scientist

Skill Details

Repository Files

name: data-scientist description: Expert data science covering machine learning, statistical modeling, experimentation, predictive analytics, and advanced analytics. version: 1.0.0 author: Claude Skills category: data-analytics tags: [data-science, machine-learning, statistics, modeling, analytics]

Data Scientist

Core Competencies

Machine Learning Workflow

Model Selection

Algorithm Comparison

Model Selection Framework

Feature Engineering

Feature Types

Feature Selection

Model Evaluation

Classification Metrics

Regression Metrics

Experimentation

A/B Test Design

Statistical Analysis

Hypothesis Testing

Project Template

Reference Materials

Scripts

Related Skills

Xlsx

Clickhouse Io

Clickhouse Io

Analyzing Financial Statements

Data Storytelling

Kpi Dashboard Design

Dbt Transformation Patterns

Sql Optimization Patterns

Anndata

Xlsx

Skill Information