Subquery Patterns And Union
by rustomax
Use OPAL subquery syntax (@labels) and union operations to combine multiple datasets or time periods. Essential for period-over-period comparisons, multi-dataset analysis, and complex data transformations. Covers @label <- @ syntax, timeshift for temporal shifts, union for combining results, and any_not_null() for collapsing grouped data.
Skill Details
Repository Files
1 file in this skill directory
name: subquery-patterns-and-union description: Use OPAL subquery syntax (@labels) and union operations to combine multiple datasets or time periods. Essential for period-over-period comparisons, multi-dataset analysis, and complex data transformations. Covers @label <- @ syntax, timeshift for temporal shifts, union for combining results, and any_not_null() for collapsing grouped data.
Subquery Patterns and Union Operations
Overview
OPAL subqueries using @label syntax enable powerful multi-dataset and multi-period analysis. This skill covers:
- Subquery syntax with
@label <- @dataset - Union operations to combine multiple result sets
- Timeshift for period-over-period comparisons
- Best practices for complex data transformations
When to Use This Skill
Use subqueries and union when you need to:
- Period-over-period comparison: Compare current vs previous hour/day/week (metrics or events)
- Time-series comparison: Chart trends over time with period-over-period data
- Complex transformations: Build intermediate results for multi-stage calculations (e.g., SLO tracking, error budgets)
Core Concepts
Subquery Syntax
# Basic pattern
@label <- @dataset_reference {
# OPAL pipeline
}
# Reference primary input
@current <- @ {
# Process primary dataset
}
# Reference named dataset (requires dataset_aliases parameter)
@other <- @dataset_name {
# Process other dataset
}
# Use subquery results
<- @label {
# Continue processing
}
Key Points:
@alone references the primary input dataset@dataset_namereferences a named dataset via aliases@labelcreates a reusable intermediate result<- @labelcontinues the pipeline from that subquery
Union Operation
Union combines multiple result sets with matching columns:
@set_a <- @ {
# First result set
}
@set_b <- @ {
# Second result set with same column structure
}
<- @set_a {
union @set_b
# Combined results
}
Important:
- Union requires matching column names
- Rows from both sources appear in output
- Use
any_not_null()to collapse sparse union results - Union happens AFTER aggregation, not before
Timeshift Verb
Timeshift moves row timestamps forward (positive) or backward (negative):
timeshift 1h # Move 1 hour forward
timeshift -1d # Move 1 day backward
timeshift 30m # Move 30 minutes forward
Critical Rule: Apply timeshift BEFORE align when working with metrics!
# CORRECT
timeshift 1h
align rate:sum(m("metric"))
# WRONG
align rate:sum(m("metric"))
timeshift 1h # Too late! Align already processed time buckets
Pattern 1: Period-Over-Period Comparison (Metrics)
Use Case: Compare current metrics to previous period (hour, day, week)
Strategy:
- Create
@currentsubquery with current period aggregation - Create
@previoussubquery with timeshift + same aggregation - Union both, then collapse with
any_not_null() - Calculate change and percentage change
Example: Compare Current Hour vs Previous Hour
# Current period (last 1h)
@current <- @ {
align rate:sum(m("span_call_count_5m"))
aggregate current_sum:sum(rate), group_by(service_name)
}
# Previous period (1h before that)
@previous <- @ {
timeshift 1h # Shift BEFORE align!
align rate:sum(m("span_call_count_5m"))
aggregate prev_sum:sum(rate), group_by(service_name)
}
# Combine both periods
@combined <- @current {
union @previous
aggregate current:any_not_null(current_sum),
previous:any_not_null(prev_sum),
group_by(service_name)
}
# Calculate changes
<- @combined {
make_col change:current - previous
make_col pct_change:if(previous > 0, (change / previous) * 100, 0)
make_col abs_pct_change:if(pct_change < 0, -pct_change, pct_change)
filter abs_pct_change > 50 # Show only significant changes
sort desc(abs_pct_change)
}
Why This Works:
timeshift 1hmoves the "previous" data timestamps forward by 1 hour- When combined with current data, both align to same time buckets
any_not_null()picks the non-null value from each period- Result: side-by-side comparison in same row
Sample Output:
service_name current previous change pct_change
frontend-service 45000 15000 30000 200.0
checkout-api 8000 15000 -7000 -46.7
payment-service 500 10000 -9500 -95.0
Day-Over-Day Comparison
@today <- @ {
align rate:sum(m("span_call_count_5m"))
aggregate today_sum:sum(rate), group_by(service_name)
}
@yesterday <- @ {
timeshift 1d # 24 hours
align rate:sum(m("span_call_count_5m"))
aggregate yesterday_sum:sum(rate), group_by(service_name)
}
@combined <- @today {
union @yesterday
aggregate today:any_not_null(today_sum),
yesterday:any_not_null(yesterday_sum),
group_by(service_name)
}
<- @combined {
make_col change:today - yesterday
make_col pct_change:if(yesterday > 0, (change / yesterday) * 100, 0)
sort desc(pct_change)
}
Week-Over-Week Comparison
@this_week <- @ {
align 1h, rate:sum(m("span_call_count_5m"))
aggregate week_sum:sum(rate), group_by(service_name)
}
@last_week <- @ {
timeshift 7d # One week
align 1h, rate:sum(m("span_call_count_5m"))
aggregate last_week_sum:sum(rate), group_by(service_name)
}
@combined <- @this_week {
union @last_week
aggregate this_week:any_not_null(week_sum),
last_week:any_not_null(last_week_sum),
group_by(service_name)
}
<- @combined {
make_col growth:this_week - last_week
make_col growth_pct:if(last_week > 0, (growth / last_week) * 100, 0)
sort desc(growth_pct)
}
Pattern 2: Period-Over-Period Comparison (Events/Intervals)
Use Case: Compare raw event/span counts across time periods
Strategy: Same union pattern, but use statsby instead of align + aggregate
Example: Error Count This Hour vs Last Hour
@current <- @ {
filter error = true
statsby current_errors:count(), group_by(service_name)
}
@previous <- @ {
timeshift 1h
filter error = true
statsby prev_errors:count(), group_by(service_name)
}
@combined <- @current {
union @previous
aggregate current:any_not_null(current_errors),
previous:any_not_null(prev_errors),
group_by(service_name)
}
<- @combined {
make_col error_change:current - previous
make_col pct_change:if(previous > 0, (error_change / previous) * 100, 0)
filter current > 10 # Only services with significant errors
sort desc(current)
}
Key Difference: Use statsby for event datasets, align + aggregate for metrics.
Pattern 3: Time-Series Period Comparison
Use Case: Chart current vs previous period trends over time
Example: Current vs Previous Week (Hourly Buckets)
@current <- @ {
align 1h, rate:sum(m("span_call_count_5m"))
aggregate current_rate:sum(rate), group_by(service_name)
}
@previous <- @ {
timeshift 7d
align 1h, rate:sum(m("span_call_count_5m"))
aggregate prev_rate:sum(rate), group_by(service_name)
}
@combined <- @current {
union @previous
aggregate current:any_not_null(current_rate),
previous:any_not_null(prev_rate),
group_by(service_name, _c_bucket)
}
<- @combined {
make_col change:current - previous
make_col pct_change:if(previous > 0, (change / previous) * 100, 0)
}
Output: Time-series with both periods aligned by bucket, suitable for line charts showing trends over time.
Note: This pattern returns multiple rows per service (one per time bucket). For summary comparisons, use Pattern 1 instead.
Pattern 4: Building Intermediate Results
Use Case: Complex calculations requiring multiple steps
Example: Calculate Error Budget Consumption
# Step 1: Get total requests and errors
@base <- @ {
align options(bins: 1), rate:sum(m("span_call_count_5m")),
errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(rate),
total_errors:sum(errors),
group_by(service_name)
}
# Step 2: Calculate SLO metrics
@slo <- @base {
make_col error_rate:if(total_requests > 0, total_errors / total_requests, 0)
make_col success_rate:1 - error_rate
make_col slo_target:0.999 # 99.9% SLO
make_col error_budget:1 - slo_target
}
# Step 3: Calculate budget consumption
<- @slo {
make_col budget_consumed:if(error_budget > 0, error_rate / error_budget, 0)
make_col budget_remaining:1 - budget_consumed
make_col status:if(budget_consumed > 1, "VIOLATED",
if(budget_consumed > 0.8, "WARNING", "HEALTHY"))
filter total_requests > 1000 # Only services with traffic
sort desc(budget_consumed)
}
Sample Output:
service_name total_requests error_rate budget_consumed status
adservice 870 0.0276 27.6 VIOLATED
cartservice 2303 0.0091 9.1 VIOLATED
frontend 15108 0.0016 1.6 VIOLATED
productcatalog 8838 0.0000 0.0 HEALTHY
Why This Works:
@basesubquery aggregates raw metrics (requests + errors)@slosubquery builds on@base, adding calculated SLO fields- Final stage uses
@sloresults to compute budget status - Each stage can reference all columns from previous stages
Understanding any_not_null()
The any_not_null() function is crucial for union patterns:
# After union, you typically have sparse data:
# Row 1: current_sum=100, prev_sum=null
# Row 2: current_sum=null, prev_sum=80
aggregate current:any_not_null(current_sum),
previous:any_not_null(prev_sum),
group_by(service_name)
# Result:
# Row 1: current=100, previous=80
How it works:
- Groups by service_name (or other dimensions)
- For each group, finds any non-null value across all union rows
- Collapses sparse union into single row per group
Alternative functions:
any()- Picks arbitrary value (may be null)min()/max()- Numeric min/max (only for numbers)any_not_null()- Best for union collapse (picks any non-null)
Common Patterns Summary
| Use Case | Subqueries Needed | Key Verbs |
|---|---|---|
| Period-over-period (metrics) | 2+ | timeshift, align, union, any_not_null() |
| Period-over-period (events) | 2+ | timeshift, statsby, union, any_not_null() |
| Time-series comparison | 2+ | timeshift, align, union, any_not_null(), group_by(_c_bucket) |
| Complex calculations | 1-3 | make_col, pipeline stages |
Note: For A/B comparisons across different filter conditions, use conditional columns with if() statements instead of subqueries. For multi-dataset joins, use lookup or join verbs (see working-with-resources skill).
Troubleshooting
Issue: "Columns don't match in union"
Cause: Union requires exact column name matches
Solution: Ensure both subqueries produce same column names
# WRONG - column names don't match
@a <- @ { aggregate count_a:count() }
@b <- @ { aggregate count_b:count() }
<- @a { union @b } # Error!
# CORRECT - same column names
@a <- @ { aggregate cnt:count() }
@b <- @ { aggregate cnt:count() }
<- @a { union @b } # Works!
Issue: "All nulls after any_not_null()"
Cause: group_by dimensions don't align across union sources
Solution: Verify both subqueries group by same dimensions
# WRONG - different group_by
@a <- @ { aggregate cnt:count(), group_by(service_name) }
@b <- @ { aggregate cnt:count(), group_by(namespace) }
<- @a { union @b; aggregate total:any_not_null(cnt), group_by(service_name) }
# Result: Nulls (no matching groups)
# CORRECT - same group_by
@a <- @ { aggregate cnt:count(), group_by(service_name) }
@b <- @ { aggregate cnt:count(), group_by(service_name) }
<- @a { union @b; aggregate total:any_not_null(cnt), group_by(service_name) }
Issue: "Timeshift has no effect"
Cause: Timeshift applied AFTER align (too late!)
Solution: Always timeshift BEFORE align
# WRONG - timeshift after align
@previous <- @ {
align rate:sum(m("metric"))
timeshift 1h # Too late!
}
# CORRECT - timeshift before align
@previous <- @ {
timeshift 1h # First!
align rate:sum(m("metric"))
}
Issue: "Can't reference @label"
Cause: Trying to use label before it's defined
Solution: Define subquery first, then reference it
# WRONG - @combined used before definition
<- @combined { ... }
@combined <- @ { ... }
# CORRECT - define first
@combined <- @ { ... }
<- @combined { ... }
Performance Considerations
When to Use Subqueries vs Single Query
Use subqueries when:
- Period-over-period comparison (timeshift required)
- Building complex intermediate results
- Readability improves significantly
Avoid subqueries when:
- Simple A/B comparison (use conditional columns)
- Single metric aggregation
- Performance is critical (subqueries add overhead)
Optimization Tips
-
Filter early: Apply filters in subqueries, not after union
# GOOD @current <- @ { filter service_name = "frontend" # Filter early align rate:sum(m("metric")) } # BAD @current <- @ { align rate:sum(m("metric")) } <- @current { filter service_name = "frontend" # Filter late (processes all services) } -
Use options(bins: 1) for summaries: Reduces data volume in union
@current <- @ { align options(bins: 1), rate:sum(m("metric")) # Single row per service aggregate total:sum(rate), group_by(service_name) } -
Limit union sources: Each union source adds processing cost
Comparison: Subquery Union vs Window Functions
Both can solve period-over-period comparison, but have different tradeoffs:
| Aspect | Subquery + Union | Window(lag) |
|---|---|---|
| Syntax | More verbose | More concise |
| Flexibility | Can compare any time periods | Limited to adjacent rows |
| Performance | Processes data twice | Single pass |
| Time buckets | Aligns arbitrary periods | Only sequential buckets |
| Use case | Day-over-day, week-over-week | Row-to-row change detection |
Example: Hour-over-hour rate of change
# Window approach (simpler for sequential buckets)
align 1h, rate:sum(m("span_call_count_5m"))
| make_col prev_rate:window(lag(rate, 1), group_by(service_name))
| make_col change:rate - prev_rate
| make_col pct:if(prev_rate > 0, (change / prev_rate) * 100, 0)
# Union approach (more flexible, can compare any offset)
@current <- @ {
align 1h, rate:sum(m("span_call_count_5m"))
aggregate current:sum(rate), group_by(service_name)
}
@previous <- @ {
timeshift 1h
align 1h, rate:sum(m("span_call_count_5m"))
aggregate prev:sum(rate), group_by(service_name)
}
@combined <- @current {
union @previous
aggregate current:any_not_null(current), prev:any_not_null(prev), group_by(service_name)
}
<- @combined {
make_col change:current - prev
make_col pct:if(prev > 0, (change / prev) * 100, 0)
}
Recommendation:
- Use window(lag) for simple sequential comparisons (previous hour, previous bucket)
- Use union + timeshift for arbitrary period comparisons (same hour yesterday, last week)
Related Skills
- window-functions-deep-dive - Covers window(lag/lead/avg) patterns
- detecting-anomalies - Uses both union and window patterns
- aggregating-gauge-metrics - Foundation for metric aggregation
- time-series-analysis - Time bucketing with timechart
Key Takeaways
- Subquery syntax:
@label <- @dataset { pipeline } - Union combines: Same schema, different filters/time periods
- Timeshift before align: Critical for metric comparisons
- any_not_null(): Collapses sparse union results
- Not always best: Consider conditional columns for simple A/B tests
- Performance matters: Union processes data multiple times
When in doubt about subquery syntax or complex union patterns, use learn_observe_skill("OPAL subquery") for official documentation.
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
