Dst Join Analysis
by mikkelkrogsholm
|
Skill Details
Repository Files
1 file in this skill directory
name: dst-join-analysis description: | Perform SQL joins and multi-table analysis on DST data in DuckDB. Use when research requires combining multiple tables on common dimensions (time, region). Provides patterns for common DST dimension joins and multi-table comparisons.
DST Multi-Table Join Analysis
Combine and analyze multiple Danmarks Statistik tables stored in DuckDB.
Common DST Dimensions
DST tables often share these dimensions:
1. Time Dimensions
tid- Time period (year, quarter, month)- Common formats: '2023', '2023K1', '2023M01'
- All tables with time series can join on this
2. Regional Dimensions
område- Geographic area (hele landet, regions, municipalities)region- Region codeskommune- Municipality codes
3. Demographic Dimensions
køn- Gender (M/K/Total)alder- Age groupsherkomst- Origin/ethnicity
Join Patterns
Pattern 1: Time-Series Join
When both tables have time dimension:
-- Example: Join population (FOLK1A) with births (FOD)
SELECT
f.tid AS year,
f.indhold AS population,
b.indhold AS births,
ROUND(b.indhold::FLOAT / f.indhold * 1000, 2) AS birth_rate_per_1000
FROM dst_folk1a f
INNER JOIN dst_fod b ON f.tid = b.tid
WHERE f.område = 'Hele landet'
AND b.område = 'Hele landet'
ORDER BY f.tid;
Use when: Comparing two time-series indicators
Pattern 2: Regional Comparison
When both tables have regional breakdown:
-- Example: Compare population (FOLK1A) across regions
SELECT
t1.område AS region,
t1.tid AS year,
t1.indhold AS population_2020,
t2.indhold AS population_2023,
ROUND((t2.indhold::FLOAT - t1.indhold) / t1.indhold * 100, 2) AS growth_pct
FROM dst_folk1a t1
INNER JOIN dst_folk1a t2
ON t1.område = t2.område
WHERE t1.tid = '2020'
AND t2.tid = '2023'
AND t1.område != 'Hele landet'
ORDER BY growth_pct DESC;
Use when: Comparing regions across same indicator
Pattern 3: Multi-Indicator Analysis
When joining different indicators by time and region:
-- Example: Correlate unemployment with business bankruptcies
SELECT
u.tid AS year,
u.indhold AS unemployment_rate,
b.indhold AS bankruptcies,
ROUND(b.indhold::FLOAT / u.indhold, 2) AS bankruptcies_per_unemployed
FROM dst_unemployment u
INNER JOIN dst_bankruptcies b
ON u.tid = b.tid
AND u.område = b.område
WHERE u.område = 'Hele landet'
ORDER BY u.tid;
Use when: Exploring relationships between different indicators
Pattern 4: Aggregate Join
When one table is at higher granularity:
-- Example: Join total population with regional breakdown
SELECT
total.tid AS year,
total.indhold AS total_population,
region.område AS region,
region.indhold AS region_population,
ROUND(region.indhold::FLOAT / total.indhold * 100, 2) AS pct_of_total
FROM dst_folk1a total
INNER JOIN dst_folk1a region
ON total.tid = region.tid
WHERE total.område = 'Hele landet'
AND region.område != 'Hele landet'
AND region.område LIKE 'Region%'
ORDER BY total.tid, pct_of_total DESC;
Use when: Comparing parts to whole
Data Type Handling
DST data stored as TEXT, convert for calculations:
-- Convert to numeric for math
SELECT
tid,
CAST(indhold AS INTEGER) AS value_int,
CAST(indhold AS FLOAT) AS value_float,
CAST(indhold AS DECIMAL(15,2)) AS value_decimal
FROM dst_table;
Common Calculations
Growth Rate
ROUND((new_value::FLOAT - old_value) / old_value * 100, 2) AS growth_pct
Year-over-Year Change
SELECT
t1.tid AS year,
t1.indhold AS current_value,
LAG(t1.indhold) OVER (ORDER BY t1.tid) AS previous_value,
t1.indhold::FLOAT - LAG(t1.indhold) OVER (ORDER BY t1.tid) AS yoy_change
FROM dst_table t1;
Moving Average
SELECT
tid,
indhold,
AVG(indhold::FLOAT) OVER (
ORDER BY tid
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3y
FROM dst_table;
Correlation (Pearson)
-- DuckDB has built-in correlation
SELECT
CORR(t1.indhold::FLOAT, t2.indhold::FLOAT) AS correlation_coefficient
FROM dst_table1 t1
INNER JOIN dst_table2 t2 ON t1.tid = t2.tid;
Handling Missing Data
DST tables may have:
NULLvalues".."for unavailable data"-"for zero/not applicable
-- Filter out missing data
WHERE indhold IS NOT NULL
AND indhold NOT IN ('..', '-', '')
AND TRY_CAST(indhold AS INTEGER) IS NOT NULL
Query Optimization Tips
- Filter early: Apply WHERE before JOIN when possible
- Use indexes: DuckDB auto-indexes, but column order matters
- Limit rows: Add LIMIT for exploration, remove for final analysis
- Aggregate wisely: Use GROUP BY only when necessary
Example Multi-Table Research Query
-- Research question: "How does population growth correlate with housing starts?"
WITH pop_growth AS (
SELECT
t1.tid AS year,
(t2.indhold::FLOAT - t1.indhold) / t1.indhold * 100 AS pop_growth_pct
FROM dst_folk1a t1
INNER JOIN dst_folk1a t2 ON t1.tid = CAST(CAST(t2.tid AS INTEGER) - 1 AS VARCHAR)
WHERE t1.område = 'Hele landet' AND t2.område = 'Hele landet'
),
housing AS (
SELECT
tid AS year,
indhold::INTEGER AS housing_starts
FROM dst_housing_table
WHERE område = 'Hele landet'
)
SELECT
p.year,
p.pop_growth_pct,
h.housing_starts,
CORR(p.pop_growth_pct, h.housing_starts) OVER () AS correlation
FROM pop_growth p
INNER JOIN housing h ON p.year = h.year
ORDER BY p.year;
Usage Guidelines
When dst-research-analyst invokes you:
- Identify join keys: What dimensions do tables share?
- Choose pattern: Which pattern fits the analysis?
- Handle data types: Convert TEXT to numeric
- Filter missing data: Remove NULL/".."/"-"
- Add calculations: Growth rates, percentages, correlations
- Order results: By time or magnitude
- Return SQL: Provide query for dst-query skill to execute
Remember: You provide the SQL patterns, dst-query executes them.
Related Skills
Attack Tree Construction
Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
Grafana Dashboards
Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Matplotlib
Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.
Scientific Visualization
Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.
Seaborn
Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.
Shap
Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Query Writing
For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Scientific Visualization
Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.
