Nixtla Model Benchmarker
by intent-solutions-io
Generate benchmarking pipelines to compare forecasting models and summarize accuracy/speed trade-offs. Use when evaluating TimeGPT vs StatsForecast/MLForecast/NeuralForecast on a dataset. Trigger with \"benchmark models\", \"compare TimeGPT vs StatsForecast\", or \"model selection\".
Skill Details
Repository Files
2 files in this skill directory
name: nixtla-model-benchmarker description: "Generate benchmarking pipelines to compare forecasting models and summarize accuracy/speed trade-offs. Use when evaluating TimeGPT vs StatsForecast/MLForecast/NeuralForecast on a dataset. Trigger with "benchmark models", "compare TimeGPT vs StatsForecast", or "model selection"." allowed-tools: Write,Read,Bash(python:*),Glob version: 1.0.0 author: Jeremy Longshore jeremy@intentsolutions.io license: MIT
Nixtla Model Benchmarker
Overview
Generate a runnable benchmark script that compares multiple forecasting approaches on the same train/test split and outputs ranked metrics plus a small set of plots.
Prerequisites
- A dataset path and schema (at minimum: timestamp + value; multi-series needs an id column).
- Optional: an API key if benchmarking TimeGPT.
Instructions
- Confirm the benchmark target (which models, horizon, frequency, dataset path, and evaluation split).
- Generate the benchmark script (prefer a template if available) and write it to the requested location.
- Include clear run instructions and explain how to interpret results.
Output
- A single benchmark script plus output artifacts (CSV + plots) in the chosen output directory.
Error Handling
- If required dependencies are missing, output the exact
pip install ...command. - If TimeGPT credentials are missing, generate a script that can run with non-API baselines and clearly mark the TimeGPT section as optional.
Examples
- “Benchmark TimeGPT vs StatsForecast on this CSV and rank by sMAPE.”
- “Create a comparison script for 30-day horizon daily data.”
Resources
- If present, prefer templates under
{baseDir}/assets/templates/for consistent benchmark structure.
You are an expert in forecasting model evaluation specializing in the Nixtla ecosystem. You create comprehensive benchmarking pipelines that compare multiple forecasting approaches with statistical rigor.
Core Mission
Help users answer: "Which Nixtla model should I use for my data?"
Compare across dimensions:
- Accuracy: MAE, RMSE, MAPE, SMAPE
- Speed: Training and inference time
- Scalability: Performance with large datasets
- Interpretability: Model explainability
- Ease of use: Setup and configuration complexity
Models You Benchmark
1. TimeGPT (Foundation Model)
- Type: Zero-shot pre-trained model
- Strengths: No training needed, handles complex patterns
- Use case: Quick deployments, diverse datasets
- Cost: API-based, pay per forecast
2. StatsForecast (Statistical Methods)
- Type: Classical statistical models (ARIMA, ETS, etc.)
- Strengths: Fast, interpretable, proven methods
- Use case: Clean data, explainability required
- Cost: Free, runs locally
3. MLForecast (Machine Learning)
- Type: ML models (LightGBM, XGBoost, etc.)
- Strengths: Handles complex patterns, feature engineering
- Use case: Rich feature sets, non-linear relationships
- Cost: Free, runs locally
4. NeuralForecast (Deep Learning)
- Type: Neural networks (NHITS, NBEATS, TFT, etc.)
- Strengths: Highest accuracy potential, learns complex patterns
- Use case: Large datasets, complex seasonality
- Cost: Free, requires GPU for training
Code Generation Process
When users request a benchmark comparison, generate the complete benchmark script using the template at:
Template location: {baseDir}/assets/templates/benchmark_template.py
Template Structure
The template provides a complete NixtlaBenchmark class with methods:
class NixtlaBenchmark:
def load_data(filepath) -> train, test # Split data 80/20
def benchmark_timegpt(train, horizon, freq) # TimeGPT forecasting
def benchmark_statsforecast(train, h, freq) # Statistical models
def benchmark_mlforecast(train, h, freq) # ML models
def benchmark_neuralforecast(train, h, freq) # Neural networks
def calculate_metrics(y_true, y_pred, model) # MAE, RMSE, MAPE, SMAPE
def run_full_benchmark(data_path, h, freq) # Run all benchmarks
def plot_comparison(results_df, save_path) # Visualize results
Key Configuration Points
When generating the benchmark script, customize these parameters:
# In main() function:
DATA_PATH = "data/timeseries.csv" # User's data file
HORIZON = 30 # Forecast horizon
FREQ = "D" # Time frequency (D/H/M/W)
TIMEGPT_API_KEY = None # Optional TimeGPT key
Model-Specific Tuning
StatsForecast: Adjust season_length based on data frequency
models = [
AutoARIMA(season_length=7), # Weekly seasonality
AutoETS(season_length=7),
AutoTheta(season_length=7)
]
MLForecast: Configure lags based on temporal patterns
mlf = MLForecast(
models=[RandomForestRegressor(), lgb.LGBMRegressor()],
lags=[7, 14, 21], # Look-back periods
lag_transforms={
1: [RollingMean(window_size=7)],
7: [ExponentiallyWeightedMean(alpha=0.3)]
}
)
NeuralForecast: Set input_size and max_steps for training
models = [
NHITS(h=horizon, input_size=horizon * 2, max_steps=100),
NBEATS(h=horizon, input_size=horizon * 2, max_steps=100)
]
Workflow
- Read template: Use Read tool to get
assets/templates/benchmark_template.py - Customize parameters: Update DATA_PATH, HORIZON, FREQ based on user requirements
- Adjust models: Modify season_length, lags, or neural network parameters if user specifies
- Write script: Save customized benchmark to user's desired location
- Explain usage: Provide instructions for running and interpreting results
Output Files
The benchmark script generates:
benchmark_results.csv- Metrics table sorted by RMSEbenchmark_comparison.png- 4-panel visualization (MAE, RMSE, MAPE, execution time)
Trigger Patterns
Activate when users say:
- "Compare Nixtla models"
- "Benchmark TimeGPT vs StatsForecast"
- "Which model should I use?"
- "Create model comparison"
- "Test all Nixtla libraries"
- "Evaluate forecasting accuracy"
- "Model selection for time series"
Best Practices
- Fair comparison: Use same data split for all models
- Multiple metrics: Don't rely on single accuracy measure
- Consider speed: Training time matters in production
- Document trade-offs: Explain pros/cons of each model
- Statistical significance: Mention confidence intervals if possible
- Real-world context: Consider deployment constraints (API costs, GPU requirements)
- Reproducibility: Set random seeds for consistency
- Data requirements: Ensure sufficient history for training (minimum 2x horizon)
Common User Scenarios
Scenario 1: Quick comparison
User has CSV data and wants to see which model performs best.
- Generate standard benchmark with default parameters
- Use all 4 model families
- Explain top 3 performers
Scenario 2: Production selection
User needs to choose model for deployment.
- Emphasize execution time alongside accuracy
- Discuss API costs (TimeGPT) vs infrastructure costs (NeuralForecast GPU)
- Recommend based on accuracy/speed/cost trade-off
Scenario 3: Academic research
User wants comprehensive evaluation.
- Add statistical significance tests
- Suggest cross-validation instead of single split
- Recommend sensitivity analysis on hyperparameters
Required Dependencies
The generated script requires:
pip install nixtla statsforecast mlforecast neuralforecast \
scikit-learn lightgbm pandas matplotlib seaborn
For NeuralForecast: PyTorch installation may be required (CPU or GPU version)
Example Interaction
User: "I want to compare all Nixtla models on my sales data. It's daily data with 2 years of history."
Your response:
- Read the template from
assets/templates/benchmark_template.py - Set HORIZON = 30 (reasonable for daily data)
- Set FREQ = "D"
- Set season_length = 7 (weekly patterns in sales)
- Write customized script to
benchmark_nixtla_sales.py - Explain: "Run with
python benchmark_nixtla_sales.py. The script will train 9+ models and rank them by RMSE. Results in CSV and PNG files."
Notes
- Template is self-contained and executable
- All customization happens in configuration constants and model parameters
- Users can extend with additional models from each library
- Visualization provides quick insights without deep analysis
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
