Dst Query
by mikkelkrogsholm
Execute SQL queries on Danmarks Statistik data stored in DuckDB. Use when user needs specific data analysis, filtering, aggregation, or joins. Also includes table summary functionality.
Skill Details
Repository Files
1 file in this skill directory
name: dst-query description: Execute SQL queries on Danmarks Statistik data stored in DuckDB. Use when user needs specific data analysis, filtering, aggregation, or joins. Also includes table summary functionality.
DST Query Skill
Purpose
Execute SQL queries to analyze DST data stored in DuckDB. This is the core skill for data analysis - enabling filtering, aggregation, joins, and extracting insights from stored statistical data.
DST Data Patterns
Handling Suppressed Values
DST uses ".." for suppressed/confidential data. Must handle before numeric operations.
Filter approach:
SELECT * FROM dst_table
WHERE INDHOLD != '..' -- Filter out suppressed
Safe casting approach:
SELECT
TID,
CASE
WHEN INDHOLD != '..' THEN CAST(INDHOLD AS INTEGER)
ELSE NULL
END as value
FROM dst_table
Using helpers:
from scripts.db.helpers import safe_numeric_cast
query = f"SELECT TID, {safe_numeric_cast('INDHOLD')} as value FROM dst_table"
Aggregate Codes
DST uses special codes for totals/aggregates:
"TOT"- Total"I alt"- Total (Danish)"Drivmidler i alt"- All fuel types"IALT"- Total (alternative)
Filter them out when analyzing details:
SELECT * FROM dst_table
WHERE fuel_type NOT IN ('Drivmidler i alt', 'I alt')
AND gender NOT IN ('TOT', 'IALT')
Time Format Handling
DST time codes as strings don't sort chronologically:
"2024K1"- Quarter 1, 2024"2024M01"- January 2024"2024"- Year 2024
Extract for proper sorting:
-- For quarters
SELECT
TID,
CAST(SUBSTRING(TID, 1, 4) AS INTEGER) as year,
CAST(SUBSTRING(TID, 6, 1) AS INTEGER) as quarter
FROM dst_table
WHERE TID LIKE '%K%'
ORDER BY year, quarter
When to Use
- User asks analytical questions about the data
- Need to filter or aggregate data
- Joining multiple DST tables
- Extracting specific insights or trends
- Computing custom statistics
- Exploring table structure and contents (use table summary)
Table Summary
Purpose
Get a quick overview of table structure and statistics before detailed querying.
Usage
python scripts/db/table_summary.py --table-id <TABLE_ID>
When to Use
- Before writing complex queries
- Understanding table structure
- Checking available columns
- Seeing sample data
- Getting quick statistics
Output Includes
- Record count
- Column names and types
- Sample rows (first 5)
- Statistics for numeric columns (min, max, avg, median)
- Distinct value counts
- NULL counts
- Top values for categorical columns
Running SQL Queries
Basic Usage
Execute a SQL query:
python scripts/db/query_data.py --sql "<QUERY>"
Output Formats
Table format (default - console-friendly):
--format table
JSON format (for programmatic use):
--format json
CSV format (for exports):
--format csv
Save to File
Save query results:
--output <file>
Safety Limit
Add automatic LIMIT:
--limit 100
Table Naming Convention
All DST tables in DuckDB follow this pattern:
- Format:
dst_{table_id}(lowercase) - Example: FOLK1A →
dst_folk1a - Example: AUP01 →
dst_aup01
Important: Always use lowercase in queries.
Data Format in DuckDB
Tables stored from DST API use these conventions:
- Separator: Data was fetched as semicolon-separated CSV (
;) - Encoding: UTF-8 with BOM (handled automatically)
- Column names: Based on variable IDs from tableinfo
- Value codes: Exact codes from DST (e.g., "000", "101", "2024K1")
- Data types: DuckDB infers types (usually strings for codes, numeric for values)
Common Query Patterns
1. Explore Data
SELECT * FROM dst_folk1a LIMIT 10
2. Count Records
SELECT COUNT(*) FROM dst_folk1a
3. Check Column Structure
-- See what columns exist
DESCRIBE dst_folk1a;
-- Or use table summary (recommended)
-- python scripts/db/table_summary.py --table-id FOLK1A
4. Aggregation
SELECT region, SUM(population) as total_pop
FROM dst_folk1a
GROUP BY region
ORDER BY total_pop DESC
5. Time Series Analysis
-- Note: Time codes from DST (e.g., "2024K1" for Q1 2024)
SELECT tid, value
FROM dst_folk1a
WHERE område = '000' -- Whole country
ORDER BY tid
6. Filtering with DST Codes
-- Use exact codes from tableinfo
SELECT *
FROM dst_folk1a
WHERE tid LIKE '2024%' -- All 2024 periods
AND område IN ('000', '101') -- Denmark and Copenhagen
AND køn IN ('1', '2') -- Men and women (not 'TOT')
7. Multiple Aggregations
SELECT
område,
COUNT(*) as record_count,
AVG(CAST(indhold AS DOUBLE)) as avg_value,
MAX(CAST(indhold AS DOUBLE)) as max_value
FROM dst_folk1a
GROUP BY område
8. Join Tables
SELECT
a.område,
a.indhold as population,
b.indhold as employment
FROM dst_folk1a a
JOIN dst_aup01 b ON a.område = b.område AND a.tid = b.tid
WHERE a.tid = '2024K1'
9. Percentages
SELECT
område,
CAST(indhold AS DOUBLE) as value,
100.0 * CAST(indhold AS DOUBLE) / SUM(CAST(indhold AS DOUBLE)) OVER () as percentage
FROM dst_folk1a
WHERE tid = '2024K1' AND køn = 'TOT'
10. Latest Period Analysis
-- Find most recent quarter
WITH latest AS (
SELECT MAX(tid) as max_tid FROM dst_folk1a
)
SELECT område, SUM(CAST(indhold AS DOUBLE)) as total
FROM dst_folk1a
WHERE tid = (SELECT max_tid FROM latest)
GROUP BY område
ORDER BY total DESC
Best Practices
Query Development
- Start with table summary to understand structure
- Use LIMIT for exploratory queries
- Build incrementally - test small queries first
- Check record counts before expensive operations
Performance
- Use WHERE clauses to filter early
- Add indexes if querying repeatedly (advanced)
- Aggregate before joining when possible
- Be mindful of large result sets
Safety
- Queries are READ-ONLY (SELECT only)
- Cannot modify data (no INSERT/UPDATE/DELETE)
- Cannot alter schema (no DROP/CREATE/ALTER)
- Script validates queries before execution
Data Quality
- Handle NULL values explicitly
- Use COALESCE for NULL handling
- Verify data types before operations
- Check for duplicates if unexpected
Understanding Data Freshness
Before analyzing, check when data was last updated:
SELECT table_id, last_updated, row_count
FROM dst_metadata
WHERE table_id = 'FOLK1A'
Recommendations:
- Check freshness before major analysis
- Re-fetch if data is stale (use dst-check-freshness skill)
- Note DST update frequency varies by table
- Some tables update quarterly, others monthly or annually
Examples
Example 1: Get table summary
python scripts/db/table_summary.py --table-id FOLK1A
Example 2: Simple exploration
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a LIMIT 5"
Example 3: Aggregation by year
python scripts/db/query_data.py --sql "SELECT year, SUM(population) as total FROM dst_folk1a GROUP BY year ORDER BY year"
Example 4: Regional analysis
python scripts/db/query_data.py --sql "SELECT region, AVG(value) as avg_val FROM dst_folk1a WHERE year >= 2020 GROUP BY region"
Example 5: Export to CSV
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a" --format csv --output results.csv
Example 6: JSON output
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a LIMIT 100" --format json --output data.json
Example 7: With safety limit
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a" --limit 1000
Advanced Queries
Window Functions
SELECT
year,
value,
LAG(value) OVER (ORDER BY year) as prev_year_value,
value - LAG(value) OVER (ORDER BY year) as change
FROM dst_folk1a
WHERE region = '000'
ORDER BY year
Pivoting Data
SELECT
region,
MAX(CASE WHEN year = 2023 THEN value END) as val_2023,
MAX(CASE WHEN year = 2024 THEN value END) as val_2024
FROM dst_folk1a
GROUP BY region
Percentiles
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as p25,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY value) as median,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75
FROM dst_folk1a
Complex Filtering
SELECT *
FROM dst_folk1a
WHERE year BETWEEN 2020 AND 2024
AND region IN (SELECT DISTINCT region FROM dst_aup01 WHERE employment > 1000)
AND value IS NOT NULL
ORDER BY value DESC
LIMIT 100
Tips
Before Querying
- Run table summary first to see structure
- Check column names and types (often Danish: område, tid, køn, etc.)
- Review sample data to understand value formats
- Verify table has data and check last_updated in dst_metadata
- Note: Column names are lowercase variable IDs from DST
Working with DST Data Codes
- Time codes:
"2024K1"(quarterly),"2024M01"(monthly),"2024"(annual) - Geographic codes:
"000"(whole country),"101"(Copenhagen), etc. - Aggregate codes:
"TOT","IALT"often represent totals - Use LIKE: For pattern matching time periods:
tid LIKE '2024%' - Cast when needed: Value columns may be strings:
CAST(indhold AS DOUBLE)
Query Writing
- Use table aliases for clarity (a, b, etc.)
- Format SQL for readability
- Comment complex queries
- Test with LIMIT first
- Handle Danish characters properly (æ, ø, å)
Analysis Workflow
- Understand: Get table summary
- Check freshness: Query dst_metadata for last_updated
- Explore: Simple SELECT with LIMIT
- Filter: Add WHERE clauses with exact DST codes
- Aggregate: Use GROUP BY (cast numeric columns first)
- Refine: Add ORDER BY, calculations
- Export: Save final results
Performance Tips
- Filter first, aggregate second
- Use specific columns, not SELECT *
- Add LIMIT for large tables
- Use indexes on commonly filtered columns (advanced)
- Consider creating views for repeated queries (advanced)
- Cache results locally if running same query multiple times
Troubleshooting
"Table not found"
- Verify table exists:
python scripts/db/list_tables.py(dst-list-tables skill) - Check table name is lowercase
- Ensure format: dst_{table_id}
- Example: FOLK1A becomes
dst_folk1a
"Column not found"
- Run table summary to see actual column names
- Column names are lowercase DST variable IDs
- Common columns:
område(region),tid(time),køn(gender),indhold(value) - Check spelling including Danish characters (æ, ø, å)
- Verify column exists in that specific table
Data Type Issues
- Value columns often stored as strings (e.g.,
indhold) - Cast to numeric for calculations:
CAST(indhold AS DOUBLE) - Time codes are strings: use LIKE for patterns
- Don't assume numeric types without checking
Unexpected Results
- Empty results: Check if data was actually fetched for that table
- Wrong aggregations: Verify you're filtering out 'TOT' codes if needed
- Time ordering issues: Time codes as strings may not sort chronologically
- Solution: Extract year/quarter or use CASE statements
- Duplicate rows: Table may have multiple dimensions - check GROUP BY
Large Result Sets
- Add LIMIT clause for exploration
- Use aggregation to reduce rows
- Export to file instead of console:
--output results.csv - Check row count first:
SELECT COUNT(*) FROM table
Slow Queries
- Check WHERE filters are effective
- Filter by indexed columns (primary keys)
- Simplify joins
- Reduce columns selected
- Check data size with COUNT first
- Avoid SELECT * on large tables
Query Syntax Errors
- Verify SQL syntax (DuckDB follows PostgreSQL conventions)
- Check quotes: use single quotes for string literals
- Danish characters: ensure UTF-8 encoding
- Test simple version first
- Review error message carefully
Character Encoding Issues
- DuckDB handles UTF-8 automatically
- If seeing odd characters, verify terminal encoding
- CSV exports preserve Danish characters (æ, ø, å)
Common Workflows
Workflow 1: Explore New Table
# 1. Get summary
python scripts/db/table_summary.py --table-id FOLK1A
# 2. See sample data
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a LIMIT 5"
# 3. Check record count
python scripts/db/query_data.py --sql "SELECT COUNT(*) FROM dst_folk1a"
# 4. Explore key dimensions
python scripts/db/query_data.py --sql "SELECT DISTINCT region FROM dst_folk1a"
Workflow 2: Trend Analysis
# 1. Get summary statistics
python scripts/db/table_summary.py --table-id FOLK1A
# 2. Query time series
python scripts/db/query_data.py --sql "SELECT year, SUM(value) as total FROM dst_folk1a GROUP BY year ORDER BY year"
# 3. Calculate growth
python scripts/db/query_data.py --sql "SELECT year, value, value - LAG(value) OVER (ORDER BY year) as growth FROM dst_folk1a WHERE region = '000'"
# 4. Export results
python scripts/db/query_data.py --sql "..." --format csv --output analysis.csv
Workflow 3: Compare Regions
# 1. Get regional breakdown
python scripts/db/query_data.py --sql "SELECT region, COUNT(*) as records, AVG(value) as avg_value FROM dst_folk1a GROUP BY region ORDER BY avg_value DESC"
# 2. Top regions
python scripts/db/query_data.py --sql "SELECT region, SUM(value) as total FROM dst_folk1a WHERE year = 2024 GROUP BY region ORDER BY total DESC LIMIT 10"
# 3. Compare specific regions
python scripts/db/query_data.py --sql "SELECT year, region, value FROM dst_folk1a WHERE region IN ('000', '101', '147') ORDER BY year, region"
SQL Reference
Useful DuckDB Functions
Aggregation:
- COUNT, SUM, AVG, MIN, MAX
- MEDIAN, STDDEV, VARIANCE
- STRING_AGG (concatenate strings)
String Functions:
- UPPER, LOWER, TRIM
- SUBSTRING, CONCAT
- LIKE, ILIKE (case-insensitive)
Date Functions:
- CURRENT_DATE, CURRENT_TIMESTAMP
- DATE_DIFF, DATE_ADD
- EXTRACT (year, month, day)
Window Functions:
- ROW_NUMBER, RANK, DENSE_RANK
- LAG, LEAD
- FIRST_VALUE, LAST_VALUE
Conditional:
- CASE WHEN ... THEN ... END
- COALESCE (handle NULLs)
- NULLIF
Best Practices Summary
- Always start with table summary
- Use LIMIT during development
- Build queries incrementally
- Handle NULLs explicitly
- Use clear aliases and formatting
- Test before running on full dataset
- Export results for further analysis
- Document complex queries
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
