Data Wrangler
by benchflow-ai
>
Skill Details
Repository Files
2 files in this skill directory
name: data-wrangler description: > Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.
Data Wrangler
Transform and export data using DuckDB SQL.
Contents
- Usage - Command syntax and Windows escaping
- Explore Mode - Quick data profiling
- Query Mode - Return results to Claude
- Write Mode - Export to files
- Request/Response Format - JSON structure
- Source Types - File, database, and cloud sources
- Transformations - SQL patterns reference
- Secrets - Secure credential handling
Usage
IMPORTANT - Windows Shell Escaping:
- Always
cdto the skill directory first - Use double quotes for echo with escaped inner quotes (
\") - Use forward slashes in file paths
cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py
Explore Mode
Get schema, statistics, and sample in one call. Use before writing queries to understand data structure.
{"mode": "explore", "path": "D:/data/sales.csv"}
Response:
{
"file": "D:/data/sales.csv",
"format": "csv",
"row_count": 15234,
"columns": [
{"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
{"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
],
"sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001 | Alice | ... |"
}
Options:
sample_rows: Number of sample rows (default: 10, max: 100)sources: For database tables (same as query mode)
Query Mode
Return results directly to Claude for analysis.
Direct File Queries
{"query": "SELECT * FROM 'data.csv' LIMIT 10"}
Multi-Source Joins
{
"query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
"sources": [
{"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
{"type": "file", "alias": "products", "path": "/data/products.csv"}
]
}
Write Mode
Export query results to files. Add an output object to write instead of returning data.
Basic Write
{
"query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
"output": {
"path": "D:/output/filtered.parquet",
"format": "parquet"
}
}
Write with Options
{
"query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
"output": {
"path": "D:/output/events/",
"format": "parquet",
"options": {
"compression": "zstd",
"partition_by": ["year", "month"],
"overwrite": true
}
}
}
Output Formats
| Format | Options |
|---|---|
parquet |
compression (zstd/snappy/gzip/lz4), partition_by, row_group_size |
csv |
header (default: true), delimiter, compression, partition_by |
json |
array (true=JSON array, false=newline-delimited) |
Write Response
Response includes verification info - no need for follow-up queries:
{
"success": true,
"output_path": "D:/output/events/",
"format": "parquet",
"rows_written": 15234,
"files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
"total_size_bytes": 5678901,
"duration_ms": 1234
}
Overwrite Protection
By default, existing files are not overwritten. Set options.overwrite: true to allow.
Request/Response Format
Request
{
"query": "SQL statement",
"sources": [...],
"output": {"path": "...", "format": "..."},
"options": {"max_rows": 200, "format": "markdown"},
"secrets_file": "path/to/secrets.yaml"
}
Query Mode Options
max_rows: Maximum rows to return (default: 200)max_bytes: Maximum response size (default: 200000)format:markdown(default),json,records, orcsv
Query Mode Response (markdown)
| column1 | column2 |
|---|---|
| value1 | value2 |
Query Mode Response (json)
{
"schema": [{"name": "col1", "type": "INTEGER"}],
"rows": [[1, "value"]],
"truncated": false,
"warnings": [],
"error": null
}
Source Types
File (auto-detects CSV, Parquet, JSON, Excel)
{"type": "file", "alias": "data", "path": "/path/to/file.csv"}
Glob patterns: {"path": "/logs/**/*.parquet"}
Custom delimiter: {"path": "/data/file.csv", "delimiter": "|"}
PostgreSQL
{
"type": "postgres", "alias": "users",
"host": "host", "port": 5432, "database": "db",
"user": "user", "password": "pass",
"schema": "public", "table": "users"
}
MySQL
{
"type": "mysql", "alias": "orders",
"host": "host", "port": 3306, "database": "db",
"user": "user", "password": "pass", "table": "orders"
}
SQLite
{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}
S3
{
"type": "s3", "alias": "logs",
"url": "s3://bucket/path/*.parquet",
"aws_region": "us-east-1",
"aws_access_key_id": "...", "aws_secret_access_key": "..."
}
Transformations
See TRANSFORMS.md for advanced patterns including:
- PIVOT/UNPIVOT - Reshape data between wide and long formats
- Sampling - Random subsets with
USING SAMPLE n ROWSorSAMPLE 10% - Dynamic columns -
EXCLUDE,REPLACE,COLUMNS('pattern') - Window functions - Running totals, rankings, moving averages
- Date/time operations - Extraction, arithmetic, formatting
Quick Examples
-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region
-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount
-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)
-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users
Workflow
- Inspect schema:
DESCRIBE SELECT * FROM 'file.csv' - Preview data:
SELECT * FROM 'file.csv' LIMIT 5 - Transform: Apply filters, joins, aggregations
- Export (optional): Add
outputto write results
Error Handling
- If
erroris non-null: Check column names, verify paths - If
truncatedis true: Use more aggregation or filters - If write fails with "exists": Set
options.overwrite: true
Secrets
Store credentials securely in YAML. See SECRETS.md for complete documentation.
{
"query": "SELECT * FROM customers LIMIT 10",
"secrets_file": "D:/path/to/secrets.yaml",
"sources": [{
"type": "postgres", "alias": "customers",
"secret": "my_postgres", "table": "customers"
}]
}
Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
