Nixtla Baseline Review
by intent-solutions-io
Analyze Nixtla baseline forecasting results (sMAPE/MASE on M4 or other
Skill Details
Repository Files
1 file in this skill directory
name: nixtla-baseline-review description: Analyze Nixtla baseline forecasting results (sMAPE/MASE on M4 or other benchmark datasets). Use when the user asks about baseline performance, model comparisons, or metric interpretation for Nixtla time-series experiments. Trigger with "baseline review", "interpret sMAPE/MASE", or "compare AutoETS vs AutoTheta". allowed-tools: Read,Grep,Bash(ls:*) version: 1.0.0 author: Jeremy Longshore jeremy@intentsolutions.io license: MIT
Nixtla Baseline Review Skill
Overview
Analyze baseline forecasting results from the nixtla-baseline-m4 workflow. Interpret metrics, compare models, surface patterns, and recommend next steps.
When to Use This Skill
Activate this skill when the user:
- Asks "Which baseline model performed best?"
- Requests interpretation of sMAPE or MASE metrics
- Wants to compare AutoETS vs AutoTheta vs SeasonalNaive
- Says "Explain these baseline results"
- Needs guidance on model selection based on baseline performance
Prerequisites
- Baseline results must exist in
nixtla_baseline_m4/directory - At minimum,
results_*.csvfile must be present - CSV format: columns
series_id,model,sMAPE,MASE
Instructions
Step 1: Locate Results Files
Use the Read tool to find baseline results:
# Check for results directory (use Bash tool)
ls -la nixtla_baseline_m4/
# Identify most recent results file
ls -t nixtla_baseline_m4/results_*.csv | head -1
Expected files:
results_M4_Daily_h{horizon}.csv- Full metrics tablesummary_M4_Daily_h{horizon}.txt- Text summary (optional)
If files are missing, inform the user they need to run /nixtla-baseline-m4 first.
Step 2: Load and Parse Metrics
Read the metrics CSV file:
# View first few rows to confirm format
head -10 nixtla_baseline_m4/results_M4_Daily_h*.csv
# Or use Read tool to load the full file
Expected CSV structure:
series_id,model,sMAPE,MASE
D1,SeasonalNaive,15.23,1.05
D1,AutoETS,13.45,0.92
D1,AutoTheta,12.34,0.87
D2,SeasonalNaive,18.67,1.23
...
Calculate summary statistics manually or with bash:
- Count total series:
cut -d, -f1 results_*.csv | sort -u | wc -l - Extract model metrics:
grep "AutoTheta" results_*.csv
Step 3: Calculate Summary Statistics
For each model (SeasonalNaive, AutoETS, AutoTheta), calculate:
- Mean sMAPE: Average across all series
- Median sMAPE: Middle value (less sensitive to outliers)
- Standard Deviation: Measure of consistency
- Series Won: Count how many series each model performed best on
Step 4: Interpret Metrics
sMAPE (Symmetric Mean Absolute Percentage Error):
- Range: 0% (perfect) to 200% (worst)
- Good: < 10%, Acceptable: 10-20%, Poor: > 20%
MASE (Mean Absolute Scaled Error):
- < 1.0: Better than seasonal naive baseline
- 1.0: Same as seasonal naive
-
1.0: Worse than seasonal naive
- Interpretation: "MASE of 0.85 means this model is 15% better than naive seasonal"
Model Characteristics:
- SeasonalNaive: Simplest baseline, repeats seasonal pattern. Good for stable series.
- AutoETS: Exponential smoothing with automatic parameter selection. Good for trend + seasonality.
- AutoTheta: Theta method with optimization. Often wins on M4 benchmarks, handles trends well.
Step 5: Identify Patterns
Look for:
- Dominant model: Which model wins most often?
- Consistency: Which model has lowest standard deviation in sMAPE?
- Series-specific patterns: Are there series types where certain models excel?
- Failure cases: Where do ALL models struggle? (sMAPE > 30%, MASE > 1.5)
Step 6: Generate Structured Explanation
Provide analysis in this format:
## Baseline Performance Analysis
Based on {N} series from M4-Daily with horizon={H}:
**Overall Winner**: {ModelName}
- Mean sMAPE: {X.XX}% (vs {Y.YY}% for second place)
- Wins on {N} out of {Total} series ({Percentage}%)
- Consistent performance (std dev {X.X}%)
**Key Insights**:
1. {Insight about winner's strengths}
2. {Insight about other models or patterns}
3. {Insight about failure cases or opportunities}
**Recommendation**: Use {ModelName} as production baseline. {Additional context about when to consider alternatives}
**Next Steps**:
- {Actionable recommendation 1}
- {Actionable recommendation 2}
Output
- A structured markdown analysis with a clear winner, key insights, and recommended next steps.
Examples
Example 1: Simple Query
User: "Which model was best?"
Response:
Based on the baseline results (50 series, horizon=14):
**AutoTheta won** with average sMAPE of 12.3%
- Outperformed AutoETS (13.5%) and SeasonalNaive (15.2%)
- Most consistent with std dev of 4.2%
- Recommended for production use
Example 2: Detailed Comparison
User: "Compare AutoETS vs AutoTheta performance"
Response:
## AutoETS vs AutoTheta Comparison
| Metric | AutoETS | AutoTheta | Winner |
|---------------|---------|-----------|----------|
| Mean sMAPE | 13.5% | 12.3% | Theta ✓ |
| Median sMAPE | 12.9% | 11.8% | Theta ✓ |
| Std Dev | 5.1% | 4.2% | Theta ✓ |
| Series Won | 18/50 | 32/50 | Theta ✓ |
**Key Differences**:
- Theta handles trends better (exponential smoothing with drift)
- ETS excels on strong seasonality (state space framework)
- Theta more consistent across diverse patterns
**Recommendation**: Use AutoTheta as default, fallback to AutoETS for highly seasonal data.
Example 3: Metric Interpretation
User: "What does sMAPE of 12.3% mean?"
Response:
sMAPE of 12.3% is **good performance** for forecasting:
- This means on average, predictions are off by about 12% from actual values
- For M4 Daily benchmark, anything under 15% is considered good
- This is in the "acceptable" range (10-20%)
Context: If you're forecasting daily sales of 100 units, a 12.3% sMAPE means your forecast will typically be within ±12 units of the true value.
MASE provides additional context - if MASE < 1.0, this forecast beats a simple "repeat last year" baseline.
Error Handling
If results files are missing:
I don't see baseline results in nixtla_baseline_m4/.
Please run the baseline command first:
Run: `/nixtla-baseline-m4 horizon=14 series_limit=50`
This will generate the metrics files I need to analyze.
If CSV is malformed:
The results file exists but appears malformed. Expected columns:
- series_id, model, sMAPE, MASE
Please re-run /nixtla-baseline-m4 to regenerate clean results.
Resources
For complete technical details, see:
- Architecture:
000-docs/6767-a-OD-ARCH-nixtla-claude-plugin-poc-baseline-lab.md - Planning:
000-docs/6767-b-PP-PLAN-nixtla-claude-plugin-poc-baseline-lab.md - Baseline Lab Overview:
000-docs/6767-d-OD-OVRV-nixtla-baseline-lab-overview.md
Related Skills
Attack Tree Construction
Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
Grafana Dashboards
Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Matplotlib
Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.
Scientific Visualization
Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.
Seaborn
Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.
Shap
Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Query Writing
For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Scientific Visualization
Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.
