Data Wrangler

by benchflow-ai

data

>

Skill Details

Repository Files

2 files in this skill directory


name: data-wrangler description: > Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.

Data Wrangler

Transform and export data using DuckDB SQL.

Contents

Usage

IMPORTANT - Windows Shell Escaping:

  1. Always cd to the skill directory first
  2. Use double quotes for echo with escaped inner quotes (\")
  3. Use forward slashes in file paths
cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py

Explore Mode

Get schema, statistics, and sample in one call. Use before writing queries to understand data structure.

{"mode": "explore", "path": "D:/data/sales.csv"}

Response:

{
  "file": "D:/data/sales.csv",
  "format": "csv",
  "row_count": 15234,
  "columns": [
    {"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
    {"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
  ],
  "sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001     | Alice    | ... |"
}

Options:

  • sample_rows: Number of sample rows (default: 10, max: 100)
  • sources: For database tables (same as query mode)

Query Mode

Return results directly to Claude for analysis.

Direct File Queries

{"query": "SELECT * FROM 'data.csv' LIMIT 10"}

Multi-Source Joins

{
  "query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
  "sources": [
    {"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
    {"type": "file", "alias": "products", "path": "/data/products.csv"}
  ]
}

Write Mode

Export query results to files. Add an output object to write instead of returning data.

Basic Write

{
  "query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
  "output": {
    "path": "D:/output/filtered.parquet",
    "format": "parquet"
  }
}

Write with Options

{
  "query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
  "output": {
    "path": "D:/output/events/",
    "format": "parquet",
    "options": {
      "compression": "zstd",
      "partition_by": ["year", "month"],
      "overwrite": true
    }
  }
}

Output Formats

Format Options
parquet compression (zstd/snappy/gzip/lz4), partition_by, row_group_size
csv header (default: true), delimiter, compression, partition_by
json array (true=JSON array, false=newline-delimited)

Write Response

Response includes verification info - no need for follow-up queries:

{
  "success": true,
  "output_path": "D:/output/events/",
  "format": "parquet",
  "rows_written": 15234,
  "files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
  "total_size_bytes": 5678901,
  "duration_ms": 1234
}

Overwrite Protection

By default, existing files are not overwritten. Set options.overwrite: true to allow.

Request/Response Format

Request

{
  "query": "SQL statement",
  "sources": [...],
  "output": {"path": "...", "format": "..."},
  "options": {"max_rows": 200, "format": "markdown"},
  "secrets_file": "path/to/secrets.yaml"
}

Query Mode Options

  • max_rows: Maximum rows to return (default: 200)
  • max_bytes: Maximum response size (default: 200000)
  • format: markdown (default), json, records, or csv

Query Mode Response (markdown)

| column1 | column2 |
|---|---|
| value1 | value2 |

Query Mode Response (json)

{
  "schema": [{"name": "col1", "type": "INTEGER"}],
  "rows": [[1, "value"]],
  "truncated": false,
  "warnings": [],
  "error": null
}

Source Types

File (auto-detects CSV, Parquet, JSON, Excel)

{"type": "file", "alias": "data", "path": "/path/to/file.csv"}

Glob patterns: {"path": "/logs/**/*.parquet"}

Custom delimiter: {"path": "/data/file.csv", "delimiter": "|"}

PostgreSQL

{
  "type": "postgres", "alias": "users",
  "host": "host", "port": 5432, "database": "db",
  "user": "user", "password": "pass",
  "schema": "public", "table": "users"
}

MySQL

{
  "type": "mysql", "alias": "orders",
  "host": "host", "port": 3306, "database": "db",
  "user": "user", "password": "pass", "table": "orders"
}

SQLite

{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}

S3

{
  "type": "s3", "alias": "logs",
  "url": "s3://bucket/path/*.parquet",
  "aws_region": "us-east-1",
  "aws_access_key_id": "...", "aws_secret_access_key": "..."
}

Transformations

See TRANSFORMS.md for advanced patterns including:

  • PIVOT/UNPIVOT - Reshape data between wide and long formats
  • Sampling - Random subsets with USING SAMPLE n ROWS or SAMPLE 10%
  • Dynamic columns - EXCLUDE, REPLACE, COLUMNS('pattern')
  • Window functions - Running totals, rankings, moving averages
  • Date/time operations - Extraction, arithmetic, formatting

Quick Examples

-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region

-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount

-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)

-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users

Workflow

  1. Inspect schema: DESCRIBE SELECT * FROM 'file.csv'
  2. Preview data: SELECT * FROM 'file.csv' LIMIT 5
  3. Transform: Apply filters, joins, aggregations
  4. Export (optional): Add output to write results

Error Handling

  • If error is non-null: Check column names, verify paths
  • If truncated is true: Use more aggregation or filters
  • If write fails with "exists": Set options.overwrite: true

Secrets

Store credentials securely in YAML. See SECRETS.md for complete documentation.

{
  "query": "SELECT * FROM customers LIMIT 10",
  "secrets_file": "D:/path/to/secrets.yaml",
  "sources": [{
    "type": "postgres", "alias": "customers",
    "secret": "my_postgres", "table": "customers"
  }]
}

Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:1/22/2026