Aggregating Gauge Metrics
by rustomax
Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill.
Skill Details
Repository Files
1 file in this skill directory
name: aggregating-gauge-metrics description: Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill.
Aggregating Gauge Metrics
Pre-computed metrics in Observe store aggregated measurements at regular intervals (typically every 5 minutes). This skill teaches how to query gauge, counter, and delta metric types using OPAL.
When to Use This Skill
- Analyzing request counts, error rates, or throughput metrics
- Tracking resource utilization (CPU, memory, network)
- Computing totals, averages, or rates across time periods
- Creating dashboards with time-series charts
- Working with any gauge, counter, or delta metric type
- When you need summary statistics or trends over time
Prerequisites
- Access to Observe tenant via MCP
- Understanding that metrics are pre-aggregated (not raw events)
- Metric dataset with type: gauge, counter, or delta
- Use
discover_context()to find and inspect metrics
Key Concepts
What Are Gauge Metrics?
Gauge metrics are pre-aggregated numeric measurements collected at regular intervals:
Pre-aggregated: Already summarized at collection time (typically 5-minute intervals)
- More efficient than querying raw data
- Faster query performance
- Lower storage costs
Common Metric Types:
- Gauge: Point-in-time value (CPU utilization, memory usage, queue depth)
- Counter: Monotonically increasing value (total requests, bytes sent)
- Delta: Change between intervals (requests per interval, errors per interval)
Examples:
span_call_count_5m- Number of requests per 5-minute intervalspan_error_count_5m- Number of errors per 5-minute intervalsystem_cpu_utilization_ratio- CPU utilization percentagek8s_pod_memory_available_bytes- Available memory in bytes
CRITICAL: The align Verb is REQUIRED
Unlike datasets (Events/Intervals), metrics MUST use the align verb:
# WRONG - Will not work ❌
m("span_call_count_5m")
| statsby total:sum(metric)
# CORRECT - Must use align ✅
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate)
Why align is required: Metrics are stored as time-series data that must be aligned to a time grid before aggregation.
Summary vs Time-Series Output
OPAL metrics queries can produce two different output types:
| Output Type | Pattern | Result | Use Case |
|---|---|---|---|
| Summary | options(bins: 1) |
One row per group | Totals, overall statistics |
| Time-Series | 5m, 1h, or default |
Many rows per group | Trending, dashboards, charts |
Summary pattern - Single statistics across entire time range:
align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate), group_by(service_name)
Output: One row per service
Time-series pattern - Values over time buckets:
align 5m, rate:sum(m("metric"))
| aggregate total:sum(rate), group_by(service_name)
Output: Multiple rows per service (one per 5-minute bucket)
CRITICAL Syntax Difference:
- Summary (
bins: 1): NO pipe|between align and aggregate - Time-series (
5m): YES pipe|between align and aggregate
Discovery Workflow
Step 1: Search for metrics
discover_context("request count", result_type="metric")
discover_context("error", result_type="metric")
discover_context("cpu memory", result_type="metric")
Step 2: Get detailed metric schema
discover_context(metric_name="span_call_count_5m")
Step 3: Verify metric type
Look for: Type: gauge (or counter, delta)
Step 4: Note available dimensions
These are used for group_by():
service_name,service_namespaceenvironment,span_namek8s_namespace_name,k8s_pod_name- etc. (shown in discovery output)
Step 5: Write query
Use align + m() + aggregate pattern with correct dimensions
Basic Patterns
Pattern 1: Total Count Across Time Range
Get overall totals (summary output):
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate)
Output: Single row with total count across entire time range.
No group_by: Aggregates everything together.
Pattern 2: Totals Per Group
Get totals broken down by dimension:
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)
Output: One row per service with total requests.
group_by: Use any dimension from metric schema.
Pattern 3: Average Values Per Group
Calculate averages across time range:
align options(bins: 1), cpu:avg(m("system_cpu_utilization_ratio"))
aggregate avg_cpu:avg(cpu), group_by(service_name)
Output: Average CPU utilization per service.
avg() function: Used twice - once in align, once in aggregate.
Pattern 4: Multiple Aggregations
Compute several statistics together:
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total:sum(rate),
average:avg(rate),
maximum:max(rate),
group_by(service_name)
Output: Multiple columns per service (total, average, maximum).
Pattern 5: Time-Series for Trending
Track values over time buckets:
align 5m, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_5min:sum(rate), group_by(service_name)
Output: Multiple rows per service (one per 5-minute interval).
Note: Pipe | required after align for time-series pattern.
Output columns:
_c_bucket- Time bucket identifiervalid_from,valid_to- Bucket boundaries- Metric values
Common Use Cases
Counting Total Requests by Service
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)
| sort desc(total_requests)
| limit 10
Use case: Identify top services by request volume.
Counting Errors with Fill for Zero Values
align options(bins: 1), errors:sum(m("span_error_count_5m"))
aggregate total_errors:sum(errors), group_by(service_name)
fill total_errors:0
Use case: Show all services, even those with zero errors.
fill verb: Replaces null values with 0.
Tracking Request Rate Over Time
align 1h, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_hour:sum(rate), group_by(service_name)
Use case: Hourly request trends for dashboards.
Output: Time-series data for charting.
Multiple Metrics in One Query
align options(bins: 1),
requests:sum(m("span_call_count_5m")),
errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
total_errors:sum(errors),
group_by(service_name)
| make_col error_rate:float64(total_errors) / float64(total_requests)
Use case: Calculate error rate from two metrics.
make_col: Add derived column after aggregation.
Resource Utilization Averages
align options(bins: 1), cpu:avg(m("system_cpu_utilization_ratio"))
aggregate avg_cpu:avg(cpu),
max_cpu:max(cpu),
group_by(k8s_pod_name)
| sort desc(avg_cpu)
| limit 20
Use case: Find pods with highest CPU usage.
Complete Example
Scenario: You want to analyze request and error rates for your microservices over the last 24 hours.
Step 1: Discover available metrics
discover_context("request error", result_type="metric")
Found metrics:
span_call_count_5m(type: gauge)span_error_count_5m(type: gauge)
Step 2: Get metric details
discover_context(metric_name="span_call_count_5m")
Available dimensions: service_name, service_namespace, environment, span_name
Step 3: Query for summary statistics
align options(bins: 1),
requests:sum(m("span_call_count_5m")),
errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
total_errors:sum(errors),
group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
| sort desc(total_requests)
Step 4: Interpret results
| service_name | total_requests | total_errors | error_rate |
|---|---|---|---|
| frontend-proxy | 15660 | 0 | 0.0 |
| frontend | 15263 | 35 | 0.23 |
| featureflagservice | 11693 | 0 | 0.0 |
| productcatalogservice | 8813 | 0 | 0.0 |
Insight: Frontend has a 0.23% error rate - investigate errors.
Step 5: Get hourly trends
align 1h,
requests:sum(m("span_call_count_5m")),
errors:sum(m("span_error_count_5m"))
| aggregate requests_per_hour:sum(requests),
errors_per_hour:sum(errors),
group_by(service_name)
| filter service_name = "frontend"
Output: Time-series showing frontend requests and errors per hour.
Common Pitfalls
Pitfall 1: Forgetting align Verb
❌ Wrong:
m("span_call_count_5m")
| statsby total:sum(metric)
✅ Correct:
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total:sum(rate)
Why: Metrics MUST use align verb - it's required, not optional.
Pitfall 2: Wrong Pipe Usage
❌ Wrong (pipe with bins:1):
align options(bins: 1), rate:sum(m("metric"))
| aggregate total:sum(rate)
❌ Wrong (no pipe with time duration):
align 5m, rate:sum(m("metric"))
aggregate total:sum(rate)
✅ Correct:
# Summary - NO pipe
align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate)
# Time-series - YES pipe
align 5m, rate:sum(m("metric"))
| aggregate total:sum(rate)
Why: Syntax differs between summary and time-series patterns.
Pitfall 3: Grouping by Non-Existent Dimension
❌ Wrong:
align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate), group_by(service_name)
Error: "field 'service_name' does not exist"
✅ Correct:
# First: discover_context(metric_name="metric") to see available dimensions
# Then: use only dimensions that exist
align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate), group_by(correct_dimension_name)
Why: Not all metrics have the same dimensions - always check first.
Pitfall 4: Using statsby Instead of aggregate
❌ Wrong:
align options(bins: 1), rate:sum(m("metric"))
statsby total:sum(rate)
✅ Correct:
align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate)
Why: After align, use aggregate (not statsby which is for datasets).
Aggregation Functions Reference
Common functions used with gauge metrics:
# Summing values
align options(bins: 1), metric:sum(m("metric_name"))
aggregate total:sum(metric)
# Averaging values
align options(bins: 1), metric:avg(m("metric_name"))
aggregate average:avg(metric)
# Maximum value
align options(bins: 1), metric:max(m("metric_name"))
aggregate maximum:max(metric)
# Minimum value
align options(bins: 1), metric:min(m("metric_name"))
aggregate minimum:min(metric)
# Count of samples
align options(bins: 1), metric:count(m("metric_name"))
aggregate sample_count:count(metric)
Pattern: Function used in both align and aggregate.
Time Bucket Options
Common time durations for time-series queries:
align 1m, ... # 1-minute buckets
align 5m, ... # 5-minute buckets (common)
align 15m, ... # 15-minute buckets
align 1h, ... # 1-hour buckets
align 1d, ... # 1-day buckets
Default: align without duration uses automatic binning (300 bins).
Best Practices
- Always use discover_context() first to find metrics and check dimensions
- Verify metric type - this skill is for gauge/counter/delta (NOT tdigest)
- Use summary pattern (
bins: 1) for single statistics, reports, totals - Use time-series pattern (
5m,1h) for dashboards, trending, charts - Remember pipe rule: bins:1 = no pipe, time duration = yes pipe
- Use fill to replace nulls with zeros for complete results
- Add sort + limit for top-N queries to avoid overwhelming output
- Check available dimensions before using group_by
Related Skills
- analyzing-tdigest-metrics - For percentile metrics (latency, duration p95/p99)
- time-series-analysis - For event/interval trending with timechart (different from metrics)
- aggregating-event-datasets - For aggregating raw events with statsby (different from metrics)
- working-with-intervals - For calculating durations from raw interval data
Summary
Gauge metrics are pre-aggregated measurements that require the align verb:
- Core pattern:
align+m()+aggregate - Metric types: gauge, counter, delta (NOT tdigest)
- Two output modes:
- Summary:
options(bins: 1)→ one row per group, NO pipe - Time-series:
5m,1h→ many rows per group, YES pipe
- Summary:
- Common functions: sum, avg, max, min, count
- Discovery: Use
discover_context()to find metrics and dimensions
Key distinction: Metrics are pre-aggregated (use align), while Events/Intervals are raw data (use statsby/timechart).
Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (ServiceExplorer/Service Metrics)
Related Skills
Attack Tree Construction
Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
Grafana Dashboards
Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Matplotlib
Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.
Scientific Visualization
Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.
Seaborn
Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.
Shap
Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Query Writing
For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Scientific Visualization
Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.
