Preferences Json Querying
by cameronraysmith
JSON querying tools and patterns using DuckDB and jaq for data extraction and transformation. Load when querying JSON data or writing data pipelines.
Skill Details
Repository Files
1 file in this skill directory
name: preferences-json-querying description: JSON querying tools and patterns using DuckDB and jaq for data extraction and transformation. Load when querying JSON data or writing data pipelines.
JSON querying
Tool selection
Choose the right tool based on your task's characteristics.
| Task | Prefer | Reason |
|---|---|---|
| Aggregations, joins, window functions | DuckDB | Full SQL analytical capabilities |
| Multi-file queries, cross-format joins | DuckDB | Native Parquet/CSV/JSON interop |
| Tabular output for analysis | DuckDB | Columnar result formatting |
| Extract/transform/filter → JSON output | jaq | Preserves JSON structure naturally |
| Unix pipelines, curl piping | jaq | Composable streaming, fast startup |
| Recursive key search in nested structures | jaq | .. operator for recursive descent |
| In-place JSON modification | jaq | Update operators (` |
| Memory-constrained streaming | jaq | Line-by-line NDJSON processing |
Decision heuristic
Use DuckDB when you think in SQL (SELECT, GROUP BY, JOIN). Use jaq when you think in transformations (map, filter, reshape).
DuckDB JSON patterns
Basic querying
DuckDB reads JSON files directly with automatic schema inference:
-- Auto-infer schema
SELECT * FROM read_json_auto('data.json');
-- Shorthand syntax
SELECT * FROM 'data.json';
-- Query remote JSON
SELECT * FROM read_json('https://api.example.com/data.json');
Nested structure handling
Unnest arrays and access nested fields using SQL:
-- Unnest array field
WITH events AS (
SELECT unnest(domainEvents) as event
FROM read_json_auto('workflow.json')
)
SELECT
event.id::VARCHAR as id,
event.description::VARCHAR as description,
event.version::INT as version
FROM events;
-- Access nested paths
SELECT
json_extract(data, '$.user.email') as email,
json_extract(data, '$.items[0].name') as first_item
FROM read_json_auto('nested.json');
Aggregations on JSON
SQL analytical functions work directly on JSON data:
SELECT
type,
COUNT(*) as count,
AVG(json_extract(data, '$.duration')::DOUBLE) as avg_duration
FROM read_json_auto('events/*.json')
GROUP BY type
ORDER BY count DESC;
Cross-format joins
Join JSON with other formats without intermediate steps:
SELECT
j.id,
j.name,
c.category_name,
p.metrics
FROM read_json_auto('entities.json') j
JOIN read_csv_auto('categories.csv') c ON j.category_id = c.id
JOIN read_parquet('metrics.parquet') p ON j.id = p.entity_id;
Output formats
Control output format from CLI:
# JSON array output
duckdb -json -c "SELECT * FROM 'data.json' LIMIT 10"
# Newline-delimited JSON
duckdb -c ".mode jsonlines" -c "SELECT * FROM 'data.json'"
# Export query results
duckdb -c "COPY (SELECT * FROM 'data.json' WHERE active) TO 'filtered.json'"
jaq patterns
jaq is a Rust reimplementation of jq, 5-10x faster with identical syntax.
Basic extraction
# Extract field
jaq '.name' data.json
# Extract from array
jaq '.items[].id' data.json
# Multiple fields
jaq '{id, name, status}' data.json
Filtering and selection
# Filter array elements
jaq '.events[] | select(.status == "active")' data.json
# Filter with multiple conditions
jaq '.items[] | select(.price > 100 and .inStock == true)' data.json
# Null-safe access
jaq '.optional?.nested?.field // "default"' data.json
Transformation
# Reshape structure
jaq '{
id: .id,
fullName: "\(.firstName) \(.lastName)",
tags: [.categories[].name]
}' data.json
# Map over arrays
jaq '.items | map({id, displayName: .name | ascii_upcase})' data.json
# Flatten nested arrays
jaq '[.groups[].members[]] | unique' data.json
Recursive descent
Find keys anywhere in nested structures:
# Find all "id" fields at any depth
jaq '.. | .id? // empty' deeply-nested.json
# Find all objects with specific key
jaq '.. | select(.type? == "error")' logs.json
In-place updates
Modify JSON while preserving structure:
# Update field
jaq '.version |= . + 1' config.json
# Add field
jaq '.metadata.updatedAt = "2025-01-05"' data.json
# Delete field
jaq 'del(.internal)' data.json
# Conditional update
jaq '(.items[] | select(.status == "pending")).status = "processed"' data.json
Pipeline composition
jaq composes naturally with Unix tools:
# API response processing
curl -s api.example.com/events | jaq '.data[] | select(.severity == "high") | .id'
# Multi-stage transformation
jaq '.raw' input.json | jaq 'map(select(.valid))' | jaq 'sort_by(.timestamp)'
# Combine with other tools
jaq -r '.items[].url' data.json | xargs -P4 curl -s
Streaming NDJSON
Process large newline-delimited JSON without loading into memory:
# Filter streaming logs
jaq 'select(.level == "error")' massive.ndjson
# Extract and count
jaq '.event_type' events.ndjson | sort | uniq -c | sort -rn
Common recipes
Inspect JSON structure
# DuckDB: show inferred schema
duckdb -c "DESCRIBE SELECT * FROM 'data.json'"
# jaq: show top-level keys
jaq 'keys' data.json
# jaq: show structure with types
jaq 'paths | map(tostring) | join(".")' data.json | head -20
Count and summarize
# DuckDB: count by field
duckdb -c "SELECT type, COUNT(*) FROM 'events.json' GROUP BY type"
# jaq: count array length
jaq '.items | length' data.json
# jaq: group and count
jaq 'group_by(.type) | map({type: .[0].type, count: length})' data.json
Extract unique values
# DuckDB
duckdb -c "SELECT DISTINCT category FROM 'products.json'"
# jaq
jaq '[.products[].category] | unique' data.json
Flatten nested arrays
# DuckDB
duckdb -c "
SELECT unnest(tags) as tag, COUNT(*) as count
FROM read_json_auto('items.json')
GROUP BY tag
"
# jaq
jaq '[.items[].tags[]] | group_by(.) | map({tag: .[0], count: length})' data.json
Join data from multiple files
# DuckDB: SQL join
duckdb -c "
SELECT a.*, b.details
FROM 'main.json' a
JOIN 'lookup.json' b ON a.ref_id = b.id
"
# jaq: manual join using INDEX
jaq -s '
(.[1] | INDEX(.id)) as \$lookup |
.[0].items | map(. + {details: \$lookup[.ref_id].details})
' main.json lookup.json
Performance considerations
DuckDB strengths
DuckDB excels when:
- Query touches subset of columns (columnar storage)
- Aggregations span large datasets
- Multiple queries against same data (keeps schema cached)
- Complex joins or window functions needed
jaq strengths
jaq excels when:
- Processing many small files (5ms startup vs DuckDB's overhead)
- Streaming large NDJSON (constant memory)
- Output must remain valid JSON
- Simple transformations in shell pipelines
Large file strategies
For files over 100MB:
- DuckDB: use explicit schema for faster parsing
- jaq: prefer NDJSON format for streaming
- Both: consider converting to Parquet for repeated queries
# Convert JSON to Parquet for repeated analysis
duckdb -c "COPY (SELECT * FROM 'large.json') TO 'large.parquet'"
# Then query Parquet (much faster)
duckdb -c "SELECT * FROM 'large.parquet' WHERE condition"
Tool availability
Both tools are available in the nix configuration:
duckdbviadatabase-packages.nixjaqviadevelopment-packages.nixjqviaprograms.jq.enable(jaq is drop-in compatible)
jaq uses identical syntax to jq but runs 5-10x faster. Existing jq scripts work with jaq without modification.
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
