Sqlmesh Expert
by vrazdalovschi
Use when working with SQLMesh projects - creating models, reviewing plans, validating changes, debugging failures, or deploying to production environments.
Skill Details
Repository Files
1 file in this skill directory
name: sqlmesh-expert description: Use when working with SQLMesh projects - creating models, reviewing plans, validating changes, debugging failures, or deploying to production environments.
SQLMesh Expert Skill
This skill enables an agent to understand SQLMesh architecture, implement models correctly, validate projects, review changes safely, and deploy with minimal risk.
When to use this skill
Use this skill when the task involves:
- Implementation: Creating or modifying SQLMesh models, audits, tests, macros, or configuration
- Review: Reviewing a PR/diff in a SQLMesh repo
- Validation: Checking that a SQLMesh project is correctly configured and runnable
- Planning: Producing or interpreting a
sqlmesh plan(backfills, restatements, forward-only) - Comparison: Comparing dev results to prod via
sqlmesh table_diff - Debugging: Investigating missing intervals, data gaps, audit failures, or test failures
- Optimization: Improving model performance, incremental strategy selection, or backfill efficiency
Non-negotiable safety rules
- Explain-first: Always run
sqlmesh plan --explainbefore applying anything. - Never run destructive commands without explicit user approval:
sqlmesh destroy(removes warehouse objects + state)sqlmesh migrate(impacts shared state; affects all users)sqlmesh rollback(global-impact)sqlmesh invalidate(removes environments)
- Do not leak secrets: Never print connection strings, tokens, passwords, or credentials from configs or env vars.
- Prefer dev environments for iteration: Treat prod like a museum—look, measure, update safely, don't smash.
- Assess blast radius: Before applying, understand how many models will backfill and for what time range.
Part 1: Architectural Foundations
Understanding SQLMesh's architecture is essential for effective debugging, optimization, and implementation decisions.
1.1 Declarative vs Imperative
SQLMesh operates on a semantic understanding of data transformation logic. Unlike traditional orchestrators (Airflow, scripts) that execute tasks in order, SQLMesh:
- Parses the Abstract Syntax Tree (AST) of SQL to build dependency graphs automatically
- Treats the warehouse as a function of the code repository
- Uses SQLGlot for cross-dialect transpilation and semantic analysis
Key insight: A textual change (adding a comment) does not alter the model fingerprint; a semantic change (modifying a WHERE clause) does.
1.2 Virtual Data Environments (VDEs)
VDEs enable zero-downtime deployments and isolated development sandboxes.
How it works:
-
Fingerprinting: Each model version gets a cryptographic hash derived from:
- Query logic (AST)
- Configuration
- Fingerprints of all upstream dependencies (Merkle tree structure)
-
Physical vs Virtual layers:
| Layer | Naming Convention | Lifecycle |
|---|---|---|
| Virtual View | target_schema.table_name |
Mutable, updated on apply |
| Physical Table | sqlmesh__schema.table_<fingerprint> |
Immutable, persistent |
| Dev View | target_schema__dev.table_name |
Ephemeral, per-user |
| State Tables | sqlmesh._snapshots |
Persistent metadata |
- Pointer Swap Protocol:
- SQLMesh creates/reuses physical table matching the fingerprint
- On apply, executes atomic
CREATE OR REPLACE VIEWto point to correct physical table - Dev environments are just different views pointing to physical tables
Agent validation rules:
- Virtual views must always resolve to valid physical tables
- Never manually drop physical tables
- State tables are essential for lineage—do not corrupt
1.3 State Management and Snapshots
SQLMesh maintains internal state in a relational database, recording every model version as Snapshots.
State Store considerations:
- Modern implementations use normalized schemas (not monolithic JSON)
- For teams: Prefer Postgres/MySQL over SQLite/DuckDB (concurrency)
- The Janitor process handles garbage collection of expired snapshots and orphaned tables
Diagnosing state bloat:
- Symptom: Slow plan generation
- Solution: Run
sqlmesh janitoror adjustsnapshot_ttl - Ensure janitor runs periodically (often in CI/CD)
Part 2: Project Configuration
2.1 Project Anatomy
A typical SQLMesh project includes:
project/
├── config.yaml (or config.py)
├── models/ # SQL and Python model definitions
├── audits/ # Shared audits (can also be inline)
├── tests/ # Unit tests
├── macros/ # Reusable SQL macros
└── seeds/ # Static CSV data
2.2 Configuration: YAML vs Python
YAML (config.yaml):
- Static, declarative
- Simpler, less prone to code execution risks
- Suitable for small, rigid projects
Python (config.py):
- Dynamic, enterprise standard
- Allows environment variable injection, conditional logic
- Security risks require monitoring
Security protocol for config.py:
# ANTI-PATTERN: Hardcoded secret
password = "super_secret_password"
# BEST PRACTICE: Environment variable
password = os.environ.get("DB_PASSWORD")
Validation checklist:
- No hardcoded credentials in config.py
-
.envfiles in.gitignore - Sensitive values come from environment variables or secret managers
2.3 Gateway Architecture
Gateways define how SQLMesh connects to compute engines and state stores.
| Parameter | Function | Agent Check |
|---|---|---|
connection |
Compute engine (Snowflake, DuckDB, etc.) | Matches target infrastructure |
state_connection |
Metadata storage | Distinct for isolated systems; Postgres for teams |
state_schema |
SQLMesh internal tables | Default sqlmesh; verify write permissions |
scheduler |
Execution backend | Compatible with orchestration layer |
Isolated Systems (Air-Gapped):
- Separate
state_connectionfor dev and prod gateways - Plan in dev cannot leverage prod data for backfill forecasting
- Synchronization happens at "promote to prod" step
2.4 Physical Schema Configuration
physical_schema_mapping:
analytics: analytics_physical
physical_table_naming_convention: schema_and_table # or hash_md5
Warning: Changing physical_table_naming_convention invalidates all fingerprints, forcing complete warehouse rebuild.
Part 3: The Semantic Modeling Layer
3.1 SQL Models
Models are defined in .sql files with a metadata header:
MODEL (
name schema.model_name,
kind INCREMENTAL_BY_TIME_RANGE (
time_column event_date
),
cron '@daily',
start '2023-01-01',
audits (not_null(columns=[id]))
);
SELECT
id::INT,
event_date::DATE,
amount::DECIMAL(10,2)
FROM upstream.events
WHERE event_date BETWEEN @start_ds AND @end_ds
Transpilation: SQLMesh can execute SQL written in one dialect on another engine (e.g., DuckDB → Databricks).
Best practices:
- Explicitly cast columns in final SELECT (
::INT,::TEXT) - Verify
dialectsetting inmodel_defaults - Run
sqlmesh planto validate transpilation compatibility
3.2 Python Models
For logic exceeding SQL expressiveness:
from sqlmesh import model
import pandas as pd
@model(
"schema.python_model",
columns={
"id": "INT",
"score": "FLOAT"
},
depends_on=["schema.upstream_model"]
)
def execute(context, start, end, **kwargs):
df = context.table("schema.upstream_model")
# Process data...
return result_df
Critical rules:
- Must return DataFrame (Pandas or Spark)
- Dependencies require explicit declaration (
depends_onorcontext.table()) - Idempotency: Never use
datetime.now()without arguments; usecontext.execution_time
3.3 Macros and Jinja
Macros enable reusable logic (defined in macros/ directory):
{% macro calculate_tax(amount, rate) %}
{{ amount }} * {{ rate }}
{% endmacro %}
Agent checks:
- Verify macros are "pure" (return SQL strings, no side effects)
- Ensure global variables (
@VAR) are defined in config - Undefined variables cause compile-time errors
3.4 Seeds and External Models
Seeds: Static CSV files committed to repo
- Keep small (<100MB)
- Verify UTF-8 encoding
- Large datasets should use ETL tools + External Models
External Models: Data not managed by SQLMesh (raw ingestion tables)
- Act as boundary layer
- Include audits (e.g.,
not_null) to validate incoming data quality - Create "Data Contracts" at ingestion point
Part 4: Incremental Strategies
4.1 Strategy Selection Matrix
| Data Characteristic | Recommended Kind | Agent Check |
|---|---|---|
| Immutable event stream | INCREMENTAL_BY_TIME_RANGE |
Check time_column & WHERE filter |
| Mutable dimensions (updates) | INCREMENTAL_BY_UNIQUE_KEY |
Check unique_key |
| Small static lookup | FULL or SEED |
Check row count (< 1M rows) |
| Complex history tracking | SCD_TYPE_2 |
Verify valid_from/valid_to logic |
| Massive table schema change | INCREMENTAL + forward_only |
Warn about historical inconsistency |
4.2 INCREMENTAL_BY_TIME_RANGE
Most efficient for large-scale event processing:
MODEL (
name schema.events,
kind INCREMENTAL_BY_TIME_RANGE (
time_column event_date
),
cron '@daily',
start '2023-01-01'
);
SELECT * FROM raw.events
WHERE event_date BETWEEN @start_ds AND @end_ds -- REQUIRED
Validation rule: If kind is INCREMENTAL_BY_TIME_RANGE, the WHERE clause MUST filter by the time column. Without this, full table scans occur on every run.
Data gaps: SQLMesh tracks specific intervals. If Tuesday fails but Wednesday succeeds, Tuesday is marked missing and will be filled on next run.
4.3 INCREMENTAL_BY_UNIQUE_KEY
For datasets with record updates (not just appends):
MODEL (
name schema.users,
kind INCREMENTAL_BY_UNIQUE_KEY (
unique_key user_id
)
);
Uses MERGE (upsert) operations.
Validation: Verify unique_key is truly unique. Duplicate keys cause non-deterministic behavior. Add unique_values audit on key columns.
4.4 Forward-Only Models
For massive tables where full backfill is impractical:
MODEL (
name schema.huge_table,
kind INCREMENTAL_BY_TIME_RANGE (...),
on_destructive_change allow
);
Critical warnings:
- Breaks semantic consistency (data is not a pure function of code)
- New columns are NULL for historical data
- Dropped columns cause query failures for old data
- Require explicit
on_destructive_change: allow - Downstream models must handle potential schema mismatches
Part 5: Quality Assurance
5.1 Unit Tests
Tests are defined in tests/ directory as YAML:
test_order_total:
model: schema.orders
inputs:
raw.order_items:
- {order_id: 1, quantity: 2, price: 10.00}
- {order_id: 1, quantity: 1, price: 5.00}
outputs:
query:
- {order_id: 1, total: 25.00}
Best practices:
- Every model with complex logic should have tests
- Use static dates (e.g.,
'2023-01-01') notCURRENT_DATE() - Test individual CTEs for large models
- Run tests:
sqlmesh testorsqlmesh test -k <pattern>
5.2 Audits
Audits validate data contracts. A query returning zero rows = pass.
MODEL (
name schema.orders,
audits (
not_null(columns=[order_id, customer_id]),
unique_values(columns=[order_id]),
accepted_values(column=status, is_in=['pending', 'shipped', 'delivered'])
)
);
Types:
- Blocking (default): Failed audit stops the pipeline
- Non-blocking: Warnings only (useful for statistical anomalies)
Standard audits to enforce:
not_nullon critical keysunique_valueson primary keysaccepted_valuesfor enum fields- Domain-specific assertions
Run explicitly: sqlmesh audit or sqlmesh audit --model <schema.model>
5.3 Table Diff
Compare candidate model against production:
sqlmesh table_diff prod:dev schema.model_name
Options:
# Specify join keys if grain not defined
sqlmesh table_diff prod:dev schema.model -o key1 -o key2
# Show sample rows (can be wide)
sqlmesh table_diff prod:dev schema.model --show-sample
# Control decimal precision for floats
sqlmesh table_diff prod:dev schema.model --decimals 3
# Skip expected-to-differ columns
sqlmesh table_diff prod:dev schema.model -s updated_at -s ingested_at
Interpreting results:
- Zero diff: Ideal for refactoring (code change, same logic)
- Expected diff: Logic changed; diff should be explainable
- Unexpected diff: Large variance without clear reason = red flag
Part 6: Core Workflow (Dev → Validate → Prod)
Step 1: Preflight Validation
# Project info and connection check
sqlmesh info
# or without warehouse:
sqlmesh info --skip-connection
# Lint all models
sqlmesh lint
# Run unit tests
sqlmesh test
# Optional: formatting check
sqlmesh format --check
# Optional: DAG visualization
sqlmesh dag dag.html
Step 2: Generate and Review Dev Plan
sqlmesh plan dev --explain
Extract and summarize:
- Which models changed (direct + downstream impact)
- Change category (breaking/non-breaking/forward-only)
- Backfill scope (models, time range, compute cost)
- Restatement presence (
--restate-model)
Key flags:
--no-gaps: Ensure no interval gaps--skip-backfill/--dry-run: Create plan without computing--diff-rendered: Diff rendered SQL--no-prompts: CI mode
Step 3: Apply to Dev (only when explicitly requested)
sqlmesh plan dev
Audits run automatically. Failed audits = stop-the-line signal.
Step 4: Validate Outputs
# Run audits explicitly
sqlmesh audit
# Run tests
sqlmesh test
# Diff dev vs prod (most important safety check)
sqlmesh table_diff prod:dev schema.model_name
Step 5: Promote to Prod
# Explain first
sqlmesh plan --explain
# or:
sqlmesh plan prod --explain
# Apply after review
sqlmesh plan
# or:
sqlmesh plan prod
# Optional: run latest intervals as part of plan
sqlmesh plan --run
Part 7: Special Scenarios
7.1 Restatements
Recompute historical intervals for existing models:
# Explain first
sqlmesh plan prod --restate-model schema.model --explain
# Apply
sqlmesh plan prod --restate-model schema.model
Gotchas:
- Restatement cascades downstream (expect bigger blast radius)
- Cannot restate a model new to the environment; it must already exist
7.2 Forward-Only Deployments
Trade historical backfill for deploy practicality:
# Explain first
sqlmesh plan prod --forward-only --explain
# Apply if approved
sqlmesh plan prod --forward-only
Notes:
- Avoids backfills
- History won't be recomputed—validate with targeted diffs and audits
- May use temporary tables/shallow clones depending on engine
7.3 Backfill Optimization
For large datasets:
MODEL (
name schema.large_model,
kind INCREMENTAL_BY_TIME_RANGE (...),
batch_size 30 -- Process in monthly chunks
);
Guidance:
- Set
batch_sizeto prevent OOM errors - SQLMesh runs backfills in parallel where dependencies allow
- Check dependency chains to maximize parallelism
Part 8: Operational Commands
Generally Safe
# List environments
sqlmesh environments
# Check missing intervals
sqlmesh check_intervals prod
# Run missing intervals
sqlmesh run prod
High-Risk (require explicit approval)
# Invalidate environment (removes via janitor)
sqlmesh invalidate <env>
# Janitor cleanup
sqlmesh janitor
# Global state changes
sqlmesh migrate
sqlmesh rollback
# Nuclear option
sqlmesh destroy
Part 9: Troubleshooting and Anti-Patterns
9.1 The "Cron Inference" Trap
Symptom: Model fails or backfills from beginning of time unexpectedly.
Cause: Incremental model has cron but no explicit start date.
Fix: Every incremental model must have an explicit start date string.
9.2 DuckDB Concurrency Locks
Symptom: "Database lock" errors during local dev or CI. Cause: Multiple processes writing to local DuckDB state file (single-writer). Fix: Use server-based state store (Postgres/MySQL) for teams; close other connections locally.
9.3 Infinite Loops in Recursive CTEs
Symptom: Plan generation hangs indefinitely.
Cause: WITH RECURSIVE block without proper termination condition.
Fix: Statically analyze recursive blocks; ensure bounded recursion depth.
9.4 Mutable External Inputs
Symptom: Data quality issues despite no code changes. Cause: Model queries raw table directly without defining as EXTERNAL model. Fix: All external inputs must be defined as EXTERNAL models to capture schema in fingerprint.
9.5 SELECT * in Production
Symptom: Performance degradation, fragile dependencies.
Cause: SELECT * creates implicit dependencies on all columns.
Fix: Enforce explicit column selection—reduces I/O, makes dependency graph precise.
9.6 Destructive Changes
Symptom: Deployment fails or downstream reports break. Cause: Column dropped or renamed. Protocol:
- Check
on_destructive_changesetting - If set to
error(default), plan is blocked - For intentional drops: explicit acknowledgment or forward-only plan
- Recommend view models to alias columns for deprecation
9.7 Python Model Non-Idempotency
Symptom: Re-running model for past date produces different results.
Cause: Python model uses datetime.now() instead of execution context.
Fix: Replace with context.execution_time or use execution-provided timestamps.
Part 10: Decision Matrices
10.1 Plan Validation Matrix
| Change Type | Detection Logic | Action Required | Risk Level |
|---|---|---|---|
| Logic Change | AST difference in query | Run table_diff, check backfill cost | Medium |
| Breaking Change | Affects downstream result | Alert user to cascade effect | High |
| Non-Breaking | Additive (e.g., new column) | Verify forward_only eligibility | Low |
| Destructive | Drop column / change type | Block unless on_destructive_change: allow |
Critical |
| Config Change | cron or owner change |
Update metadata (Virtual Update) | Negligible |
10.2 Model Optimization Heuristics
| Observation | Diagnosis | Recommendation |
|---|---|---|
| INCREMENTAL model lacks WHERE clause | Full table scan (inefficient) | Add WHERE time_col BETWEEN @start_ds AND @end_ds |
batch_size missing for large table |
Potential OOM risk | Set batch_size (e.g., 30) |
unique_key uses high-cardinality columns |
Slow merge performance | Check clustering/partitioning keys |
Python model calls datetime.now() |
Non-idempotent | Replace with context.execution_time |
| Raw table in FROM clause | Unmanaged dependency | Define as EXTERNAL model |
Part 11: Report Templates
A) Project Validation Report
## Project Validation Report
### What I Checked
- [ ] Config present and valid
- [ ] Required directories exist (models/, etc.)
- [ ] `sqlmesh info` results (models count, connection status)
- [ ] Lint status
- [ ] Test status
### Findings
- Key risks:
- Missing elements:
### Recommendations (priority order)
1. ...
2. ...
B) Plan Review Report
## Plan Review Report
### Plan Target
- Environment: dev / prod
### Change Summary
- Direct model edits:
- Downstream impact:
- Change categorization: breaking / non-breaking / forward-only
### Backfill/Restatement
- Models:
- Time range:
- Expected compute cost:
### Validation Evidence
- Audits: pass / fail
- Tests: pass / fail
- table_diff highlights:
### Go/No-Go Decision
- Decision:
- Justification:
- Risks:
Reference Links
Related Skills
Attack Tree Construction
Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
Grafana Dashboards
Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Matplotlib
Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.
Scientific Visualization
Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.
Seaborn
Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.
Shap
Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Query Writing
For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Scientific Visualization
Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.
