Nixtla Benchmark Reporter

by intent-solutions-io

skill

Generate comprehensive markdown benchmark reports from forecast accuracy metrics with model comparisons, statistical analysis, and regression detection. Use when analyzing baseline performance, comparing forecast models, or validating model quality. Trigger with 'generate benchmark report', 'analyze forecast metrics', or 'create performance summary'.

Skill Details

Repository Files

4 files in this skill directory


name: nixtla-benchmark-reporter description: Generate comprehensive markdown benchmark reports from forecast accuracy metrics with model comparisons, statistical analysis, and regression detection. Use when analyzing baseline performance, comparing forecast models, or validating model quality. Trigger with 'generate benchmark report', 'analyze forecast metrics', or 'create performance summary'. allowed-tools: "Read,Write,Glob,Bash(python:*)" version: "1.0.0" author: "Jeremy Longshore jeremy@intentsolutions.io" license: MIT

Nixtla Benchmark Reporter

Purpose

Generate production-ready benchmark reports from forecasting accuracy metrics, enabling systematic model comparison and regression detection for Nixtla forecasting workflows.

Overview

This skill transforms raw forecast metrics (sMAPE, MASE, MAE, RMSE) into actionable insights. It:

  • Parses benchmark results CSV files from statsforecast/TimeGPT experiments
  • Calculates summary statistics (mean, median, std dev, percentiles)
  • Generates model comparison tables with winners highlighted
  • Creates regression detection reports comparing current vs. baseline results
  • Produces GitHub issue templates for performance degradations
  • Generates markdown reports with embedded charts and recommendations

Key Benefits:

  • Automates tedious manual benchmarking analysis (2-3 hours → 2 minutes)
  • Provides consistent reporting format across all forecasting experiments
  • Detects performance regressions automatically
  • Generates shareable, version-controlled markdown reports

Prerequisites

  • Benchmark results CSV files with metrics per series and model
  • CSV format: columns series_id, model, sMAPE, MASE (minimum)
  • Optional: Baseline results CSV for regression comparison
  • Python 3.8+ with pandas, numpy installed

Expected CSV Structure:

series_id,model,sMAPE,MASE,MAE,RMSE
D1,SeasonalNaive,15.23,1.05,12.5,18.3
D1,AutoETS,13.45,0.92,10.2,15.1
D1,AutoTheta,12.34,0.87,9.8,14.5
D2,SeasonalNaive,18.67,1.23,15.1,22.4
...

Instructions

Step 1: Parse Benchmark Results

The script automatically:

  1. Reads benchmark CSV file(s)
  2. Validates CSV structure (required columns present)
  3. Extracts unique models and series
  4. Groups metrics by model

Usage:

python {baseDir}/scripts/generate_benchmark_report.py \
    --results /path/to/benchmark_results.csv \
    --output /path/to/report.md

Step 2: Calculate Summary Statistics

For each model, calculates:

  • Mean: Average metric across all series
  • Median: Middle value (less sensitive to outliers)
  • Std Dev: Measure of consistency
  • Min/Max: Best and worst performance
  • Percentiles: 25th, 50th, 75th, 95th percentiles
  • Win Rate: Percentage of series where model performed best

Step 3: Generate Comparison Table

Creates markdown table comparing all models:

## Model Comparison (sMAPE)

| Model | Mean | Median | Std Dev | Min | Max | Wins |
|-------|------|--------|---------|-----|-----|------|
| AutoTheta | 12.3% | 11.8% | 4.2% | 5.1% | 28.9% | 32/50 (64%) |
| AutoETS | 13.5% | 12.9% | 5.1% | 6.2% | 31.2% | 18/50 (36%) |
| SeasonalNaive | 15.2% | 14.5% | 6.3% | 7.8% | 35.4% | 0/50 (0%) |

Step 4: Identify Winner and Recommendations

Determines overall best model based on:

  1. Primary metric: Lowest mean sMAPE/MASE
  2. Consistency: Lowest standard deviation
  3. Win rate: Most series won

Generates recommendations:

  • Production baseline model selection
  • When to use alternatives (e.g., AutoETS for seasonal data)
  • Failure case analysis (series where all models struggle)

Step 5: Regression Detection (Optional)

If baseline results provided, compares current vs. baseline:

python {baseDir}/scripts/generate_benchmark_report.py \
    --results current_results.csv \
    --baseline baseline_results.csv \
    --output regression_report.md \
    --threshold 5.0  # Alert if sMAPE degrades >5%

Regression Report Includes:

  • Models with performance degradation
  • Severity of regression (% change)
  • Affected series
  • GitHub issue template for regressions

Step 6: Customize Report Format

Supports multiple output formats:

Standard Report (default):

python {baseDir}/scripts/generate_benchmark_report.py --results metrics.csv

Executive Summary (1-page):

python {baseDir}/scripts/generate_benchmark_report.py \
    --results metrics.csv \
    --format executive \
    --output summary.md

GitHub Issue Template:

python {baseDir}/scripts/generate_benchmark_report.py \
    --results metrics.csv \
    --format github \
    --output .github/ISSUE_TEMPLATE/regression.md

Output

The script generates:

Standard Report (report.md):

  1. Executive Summary (1-2 paragraphs)
  2. Model Comparison Table (all metrics)
  3. Statistical Analysis (means, std devs, percentiles)
  4. Winner Declaration with justification
  5. Per-Series Breakdown (optional)
  6. Recommendations for production use
  7. Failure Case Analysis (series with sMAPE > 30%)

Regression Report (if baseline provided):

  1. Regression Summary (models degraded)
  2. Severity Analysis (% change per model)
  3. Affected Series List
  4. GitHub Issue Template

GitHub Issue Template:

---
title: "Performance Regression Detected: {model_name}"
labels: ["regression", "performance"]
assignees: ["team-lead"]
---

## Regression Summary
Model: {model_name}
Metric: sMAPE degraded by {X}%
Baseline: {baseline_value}%
Current: {current_value}%

## Affected Series
- {series_1}: {baseline}% → {current}% ({delta}%)
- {series_2}: {baseline}% → {current}% ({delta}%)
...

## Acceptance Criteria
- [ ] Investigate root cause
- [ ] Restore performance to within 2% of baseline
- [ ] Add regression test to CI/CD

Error Handling

Missing Metrics File:

Error: Benchmark results not found at /path/to/results.csv
Solution: Verify path and ensure CSV file exists

Invalid CSV Structure:

Error: Required columns missing: series_id, model, sMAPE
Solution: Ensure CSV has minimum required columns

Empty Results:

Warning: No metrics found in CSV file
Solution: Verify CSV has data rows (not just headers)

Regression Threshold Exceeded:

🚨 REGRESSION DETECTED: AutoTheta sMAPE degraded by 12.5%
  Baseline: 12.3%
  Current: 13.8%
  Threshold: 5.0%
Solution: Review recent model changes, check data quality

Examples

Example 1: Generate Standard Benchmark Report

python {baseDir}/scripts/generate_benchmark_report.py \
    --results nixtla_baseline_m4/results_M4_Daily_h14.csv \
    --output reports/m4_daily_baseline.md \
    --verbose

Output:

✓ Loaded 150 results (50 series × 3 models)
✓ Calculated summary statistics
✓ Identified winner: AutoTheta (mean sMAPE: 12.3%)
✓ Generated report: reports/m4_daily_baseline.md (1,245 words)

Example 2: Detect Regressions vs. Baseline

python {baseDir}/scripts/generate_benchmark_report.py \
    --results current_run/results.csv \
    --baseline baseline/v1.0_results.csv \
    --output regression_report.md \
    --threshold 3.0

Output:

⚠️  REGRESSION DETECTED in 2/3 models:
  - AutoETS: sMAPE 13.5% → 14.8% (+9.6%)
  - AutoTheta: sMAPE 12.3% → 12.7% (+3.3%)
✓ Generated regression report with GitHub issue template

Example 3: Generate Executive Summary

python {baseDir}/scripts/generate_benchmark_report.py \
    --results quarterly_benchmark.csv \
    --format executive \
    --output Q1_summary.md

Output:

# Q1 2025 Forecast Baseline Report

**Winner**: AutoTheta with 12.3% sMAPE (vs. 13.5% AutoETS, 15.2% Naive)

**Key Findings**:
- AutoTheta won 64% of series (32/50)
- Most consistent performance (std dev 4.2%)
- Recommended for production baseline

**Action Items**:
- Deploy AutoTheta as default model
- Use AutoETS for highly seasonal data (criteria: seasonal_strength > 0.8)
- Investigate 3 failure cases (sMAPE > 30%)

Example 4: Custom Metric Focus

python {baseDir}/scripts/generate_benchmark_report.py \
    --results results.csv \
    --primary-metric MASE \
    --output mase_focused_report.md

Best Practices

  1. Version Control Reports: Commit generated reports to track performance over time
  2. Automate in CI/CD: Generate reports automatically on every benchmark run
  3. Set Regression Thresholds: Use --threshold to catch regressions early (recommend 3-5%)
  4. Include Timestamps: Reports automatically include generation date/time
  5. Document Assumptions: Reports include metadata about benchmark setup
  6. Share with Stakeholders: Markdown reports render nicely on GitHub/GitLab
  7. Archive Baselines: Keep historical baseline CSVs for regression comparison

Resources

Related Skills

Attack Tree Construction

Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.

skill

Grafana Dashboards

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

skill

Matplotlib

Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.

skill

Scientific Visualization

Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.

skill

Seaborn

Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.

skill

Shap

Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model

skill

Pydeseq2

Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.

skill

Query Writing

For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations

skill

Pydeseq2

Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.

skill

Scientific Visualization

Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.

skill

Skill Information

Category:Skill
License:MIT
Version:1.0.0
Allowed Tools:Read,Write,Glob,Bash(python:*)
Last Updated:12/23/2025