Duckdb Sql
by robdmc
Generate DuckDB SQL queries. Use when user asks about DuckDB queries, data analysis, exploring .ddb database files, CSV files, Parquet files, wants help editing/improving SQL, asks to use the duckdb skill, references duckdb assets, or wants to initialize/setup duckdb analysis.
Skill Details
Repository Files
5 files in this skill directory
name: duckdb-sql description: Generate DuckDB SQL queries. Use when user asks about DuckDB queries, data analysis, exploring .ddb database files, CSV files, Parquet files, wants help editing/improving SQL, asks to use the duckdb skill, references duckdb assets, or wants to initialize/setup duckdb analysis. allowed-tools: Read, Grep, Glob, Bash
DuckDB SQL Query Assistant
Generate SQL queries for DuckDB databases, CSV files, and Parquet files.
Reference Files
- Query patterns: Multi-file queries, common SQL patterns, CSV handling
- DuckDB syntax: Data types, functions, CLI commands
- Setup guide: First-time setup, enum detection, schema updates
- Data dictionary template: Template for generating data_dictionary.md
Core Workflow
Step 1: Check for Existing Assets
Your FIRST action must be to check if duckdb_sql_assets/ exists.
Even if the user provides direct file paths (e.g., @file.ddb), check for existing assets first.
If duckdb_sql_assets/ exists with tables_inventory.json, schema_*.sql, and data_dictionary.md:
NEVER:
- Run
duckdbbash commands (no.schema, noDESCRIBE, no direct queries) - Open or access
.ddbfiles directly - Regenerate asset files
- Check for schema changes (unless explicitly requested)
ALWAYS:
- Read
tables_inventory.jsonfor available tables and file paths - Read
data_dictionary.mdfor business context - Read relevant
schema_*.sqlfiles to validate column names - Generate queries using documented schema only
The only exception: user explicitly requests schema refresh (see setup guide).
Step 2: Identify Question Type
- Discovery questions ("Do we have...?", "Where is...?") -> Search documentation, explain findings
- Query requests ("Show me...", "List all...") -> Use two-step query plan workflow
- SQL review requests ("Review this SQL", "Improve this query") -> Use SQL review workflow
Step 3: If No Assets Exist
See First-Time Setup for the complete workflow.
Display-Only by Default
DO NOT execute queries unless explicitly requested. Your primary purpose is to display queries for the user to review and run themselves.
- Default: Generate and display SQL only
- Execute: Only when user says "run this", "execute it", "show me the results"
Asset Directory
Terminology: "assets", "the assets", "duckdb assets" = duckdb_sql_assets/ directory.
Contents:
tables_inventory.json- Manifest of source files and table metadataschema_<filename>.sql- Schema files (one per source file)data_dictionary.md- Semantic documentation of tables and fields
Supported File Types
| File Type | Extension | Tables per File | Schema Source |
|---|---|---|---|
| DuckDB Database | .ddb |
Multiple | Native schema via .schema |
| CSV | .csv |
Single | Auto-inferred by DuckDB |
| Parquet | .parquet |
Single | Embedded in file |
Table naming for CSV/Parquet: Convert filename to lowercase snake_case (e.g., My Transactions-2024.csv -> my_transactions_2024).
Query Execution Model
All queries run in an in-memory DuckDB session (duckdb with no file argument).
| File Type | How to Reference | Example |
|---|---|---|
.ddb table |
_db_alias.tablename after ATTACH |
_db_sales.customers |
| CSV file | File path in quotes | 'data/transactions.csv' |
| Parquet file | File path in quotes | 'data/events.parquet' |
| Glob pattern | Glob in quotes | 'logs/*.csv' |
ATTACH Alias Convention
Use _db_ prefix + filename slug: sales.ddb -> AS _db_sales
Standard Query Preamble
-- Setup: Attach database files
ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY);
-- Query
SELECT c.name, o.total_amount
FROM _db_sales.customers c
JOIN _db_sales.orders o ON c.customer_id = o.customer_id;
Path convention: Use relative paths by default for portability.
See query-patterns.md for more examples.
Schema Validation (CRITICAL)
Before generating ANY SQL, validate every table and column against asset files only:
- Verify all table names exist in
tables_inventory.jsonor schema files - Verify all column names exist in
schema_<filename>.sql - Verify column types match usage
- Verify JOIN columns exist on both sides
If you cannot find a table or column, DO NOT use it. Instead:
- Tell user the field doesn't exist
- Suggest similar fields that DO exist
- Ask for clarification
Common hallucination patterns to avoid:
- Assuming
nameexists (check forfirst_name,product_name, etc.) - Assuming
user_idwhen it might beprofile_idorcustomer_id - Inventing status values not in the data
Query Generation - Two-Step Workflow
ALWAYS present a query plan first before writing SQL.
Step 1: Present Query Plan
**Query Plan:**
- **Attach statements needed:**
- `ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY)`
- **Tables:**
- _db_sales.customers (c) - Customer records [from sales.ddb]
- _db_sales.orders (o) - Order transactions [from sales.ddb]
- **Joins:**
- _db_sales.customers -> _db_sales.orders on customer_id
- **Filters:**
- Optional: date range on order_date
- **Output:** Returns order details with customer names
Does this plan look correct?
If ambiguous, present multiple options and let user choose.
Step 2: Generate SQL After Approval
Provide:
- The SQL query - formatted and ready to copy/run
- Brief explanation - what it does in plain English
- Parameters - values that might need adjustment
- Warnings (if any) - performance concerns, assumptions
Modifying Existing SQL
When user provides SQL to modify:
- Understand the original query
- Identify changes needed
- Make minimal changes - preserve user's style
- Explain what changed and why
Reviewing User SQL
Review Checklist
- Correctness - Tables/columns exist? JOINs correct? Aggregations grouped properly?
- Performance -
SELECT *when specific columns suffice? Unnecessary subqueries? Filters applied early? - Readability - Aliases consistent? Formatting clean?
- Best Practices - Explicit columns? Proper NULL handling? Date ranges correct?
Response Format
**Review of your SQL:**
- **Correct:** [things that are right]
- **Issues found:**
1. [Issue] - [Why it's a problem]
- **Suggestions:**
1. [Improvement] - [Benefit]
**Improved version:**
[Corrected SQL if needed]
**Changes:**
- [Change 1]: [Why]
Common Issues to Check
- Missing JOINs or Cartesian products
- Wrong JOIN type (INNER vs LEFT)
- Ambiguous columns without alias
- Type mismatches without CAST
- Off-by-one dates (using
<=instead of<) - GROUP BY errors
- NULL gotchas (
= NULLinstead ofIS NULL)
Query Quality Guidelines
Column Selection
- Use explicit column names, not
SELECT * - Include table aliases (e.g.,
cfor customers) - Add appropriate WHERE clauses
Enum Values
- Use ONLY values documented in data dictionary
- If unsure, say so rather than guessing
Date/Time Handling
- Use DuckDB functions:
current_date,INTERVAL,date_trunc() - For ranges:
>= start AND < end(half-open interval) - String concatenation:
||operator
NULL Handling
- Use
IS NULL/IS NOT NULL - Consider whether NULLs should be included
- Use
COALESCE()for defaults
Learning and Adding Facts
When you discover new information about the data:
- 1-2 facts: Ask inline "Should I add this to the data dictionary?"
- 3+ facts: Present summary and ask for bulk decision
- If approved: Use Edit tool to update
data_dictionary.md
Safety Guidelines
- Generate ONLY read-only queries (SELECT statements)
- NEVER generate INSERT, UPDATE, DELETE, DROP, TRUNCATE
- If user asks for data modification, explain you can only generate read queries
- Always validate columns exist before using them
- When unsure about data meaning, ask rather than guess
tables_inventory.json Format
{
"generated_at": "2025-12-11T17:30:00Z",
"duckdb_version": "v1.3.2",
"sources": [
{
"file_path": "data/sales.ddb",
"file_name": "sales.ddb",
"file_type": "ddb",
"tables": ["customers", "orders"]
},
{
"file_path": "data/transactions.csv",
"file_name": "transactions.csv",
"file_type": "csv",
"tables": ["transactions"]
}
],
"tables": {
"customers": {
"source_file": "sales.ddb",
"file_type": "ddb",
"columns": [
{"name": "customer_id", "type": "INTEGER"},
{"name": "name", "type": "VARCHAR"}
]
}
}
}
File type values: ddb, csv, parquet, csv_glob, parquet_glob
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
