Duckdb Data Explorer
by alexismanuel
This skill should be used when performing local data exploration, profiling, quality analysis, or transformation tasks using DuckDB. It handles CSV, Parquet, and JSON files, provides automated data quality reports, supports complex JSON transformations, and generates interactive HTML reports for data analysis.
Skill Details
Repository Files
5 files in this skill directory
name: duckdb-data-explorer description: This skill should be used when performing local data exploration, profiling, quality analysis, or transformation tasks using DuckDB. It handles CSV, Parquet, and JSON files, provides automated data quality reports, supports complex JSON transformations, and generates interactive HTML reports for data analysis.
DuckDB Data Explorer
Overview
This skill enables comprehensive local data exploration using DuckDB, supporting automated data profiling, quality analysis, and transformation workflows for CSV, Parquet, and JSON files. It provides reusable scripts for common data tasks, reference patterns for complex queries, and HTML report generation for interactive data visualization.
Quick Start
Basic Data Profiling
To quickly analyze a data file and generate a quality report:
-
Use
scripts/data_profiler.pyto profile the file:python scripts/data_profiler.py data.csv --output profile.json -
Generate an HTML report:
python scripts/html_report_generator.py profile.json report.html -
Open the HTML report to view data quality metrics, null analysis, and sample data.
JSON Data Transformation
For complex JSON handling and transformation:
-
Analyze JSON structure:
python scripts/json_transformer.py structure data.json -
Transform JSON data:
python scripts/json_transformer.py transform "*.json" "SELECT json_extract(data, '$.user.name') as name FROM json_data" --output transformed.parquet
Core Capabilities
1. Data Profiling and Quality Analysis
Use scripts/data_profiler.py for automated data quality assessment:
- Null analysis: Percentage of null values per column
- Data type detection: Automatic type inference and validation
- Uniqueness analysis: Distinct value counts and percentages
- Sample data: First 10 rows for quick inspection
- Summary statistics: Row counts, column counts, file type detection
When to use: Initial data exploration, data quality assessment, before data cleaning or transformation.
2. Complex JSON Handling
Use scripts/json_transformer.py for advanced JSON operations:
- Structure analysis: Understand nested JSON schemas
- Pattern matching: Extract specific fields from complex JSON
- Array operations: Flatten and transform JSON arrays
- Multi-file processing: Handle glob patterns like
*.json - Export options: Output to Parquet or JSON formats
When to use: Working with nested JSON data, API responses, log files, or document databases.
3. Interactive HTML Reports
Use scripts/html_report_generator.py to create visual data exploration reports:
- Summary dashboards: Key metrics at a glance
- Column analysis: Detailed null percentage bars with color coding
- Sample data viewer: Scrollable data tables with copy functionality
- Quality indicators: Visual quality scores for each column
- Responsive design: Works on desktop and mobile devices
When to use: Sharing data insights, creating documentation, or interactive data exploration.
4. DuckDB Query Patterns
Reference references/duckdb_patterns.md for common query patterns:
- File reading: CSV, Parquet, JSON ingestion patterns
- Data quality: Null analysis, type validation, duplicate detection
- Transformation: String operations, date handling, JSON extraction
- Performance: Optimization tips and best practices
When to use: Writing custom DuckDB queries, optimizing performance, learning DuckDB syntax.
5. JSON Function Reference
Reference references/json_functions.md for comprehensive JSON function documentation:
- Extraction functions:
json_extract,json_extract_string, etc. - Array operations:
json_array_length,json_contains, etc. - Structure analysis:
json_structure,json_type, etc. - Advanced patterns: Nested object handling, conditional operations
When to use: Complex JSON transformations, API data processing, nested data extraction.
6. Data Quality Framework
Reference references/data_quality_checks.md for comprehensive quality assessment:
- Null analysis: Multi-dimensional null value reporting
- Type validation: Consistency checking and conflict detection
- Outlier detection: Statistical methods for anomaly identification
- Duplicate analysis: Record and column-level duplication detection
- Quality scoring: Automated quality assessment metrics
When to use: Data quality audits, data cleaning workflows, data validation pipelines.
Workflow Examples
Example 1: Initial Data Exploration
# Profile a new dataset
python scripts/data_profiler.py sales_data.csv --output sales_profile.json
# Generate interactive report
python scripts/html_report_generator.py sales_profile.json sales_report.html
# Open report for exploration
open sales_report.html
Example 2: JSON Data Transformation
# Analyze JSON structure
python scripts/json_transformer.py structure api_responses.json
# Transform and flatten JSON data
python scripts/json_transformer.py transform "logs/*.json" \
"SELECT
json_extract(data, '$.timestamp') as timestamp,
json_extract(data, '$.user.id') as user_id,
json_extract(data, '$.event.type') as event_type
FROM json_data" \
--output cleaned_logs.parquet
Example 3: Data Quality Assessment
# Profile data for quality issues
python scripts/data_profiler.py customer_data.parquet --output quality_profile.json
# Generate detailed quality report
python scripts/html_report_generator.py quality_profile.json quality_report.html
# Use reference queries for deeper analysis
duckdb :memory: "SELECT * FROM read_parquet('customer_data.parquet') LIMIT 10"
File Type Support
CSV Files
- Auto-detection of delimiters, headers, and data types
- Custom configuration support for non-standard formats
- Multiple file ingestion with glob patterns
Parquet Files
- Columnar format optimization
- Schema preservation and type inference
- Efficient handling of large datasets
JSON Files
- Auto-detection of JSON structure
- Support for nested objects and arrays
- Multi-file processing with glob patterns
- Complex JSON function support
Integration Patterns
With PostgreSQL Export
When exporting to PostgreSQL for analysis:
- Profile before export: Use data profiler to understand data quality
- Transform to Parquet: Clean and transform data locally first
- Export to PostgreSQL: Use DuckDB's PostgreSQL extension or export Parquet
- Validate export: Compare profiles before and after export
Batch Processing
For processing multiple files:
# Process all CSV files in directory
for file in data/*.csv; do
python scripts/data_profiler.py "$file" --output "profiles/$(basename "$file" .csv).json"
python scripts/html_report_generator.py "profiles/$(basename "$file" .csv).json" "reports/$(basename "$file" .csv).html"
done
Resources
scripts/
Executable Python scripts for data operations:
data_profiler.py: Automated data quality analysis and profilingjson_transformer.py: Complex JSON handling and transformation utilitieshtml_report_generator.py: Interactive HTML report generation
references/
Comprehensive documentation for DuckDB operations:
duckdb_patterns.md: Common query patterns and best practicesjson_functions.md: Complete JSON function reference with examplesdata_quality_checks.md: Data quality assessment frameworks and queries
assets/
Templates and resources for output generation:
report_template.html: Interactive HTML template for data exploration reports
Best Practices
- Start with profiling: Always profile data before transformation
- Use HTML reports: Generate interactive reports for better insights
- Leverage patterns: Use reference patterns for common operations
- Validate transformations: Profile data before and after transformations
- Handle nulls explicitly: Use DuckDB's null handling functions
- Optimize queries: Use column pruning and early filtering
- Document processes: Keep track of transformations for reproducibility
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
