Analyzing Tdigest Metrics
by rustomax
Analyze percentile metrics (tdigest type) using OPAL for latency analysis and SLO tracking. Use when calculating p50, p95, p99 from pre-aggregated duration or latency metrics. Covers the critical double-combine pattern with align + m_tdigest() + tdigest_combine + aggregate. For simple metrics (counts, averages), see aggregating-gauge-metrics skill.
Skill Details
Repository Files
1 file in this skill directory
name: analyzing-tdigest-metrics description: Analyze percentile metrics (tdigest type) using OPAL for latency analysis and SLO tracking. Use when calculating p50, p95, p99 from pre-aggregated duration or latency metrics. Covers the critical double-combine pattern with align + m_tdigest() + tdigest_combine + aggregate. For simple metrics (counts, averages), see aggregating-gauge-metrics skill.
Analyzing TDigest Metrics
TDigest metrics in Observe store pre-aggregated percentile data for efficient latency and duration analysis. This skill teaches the specialized pattern for querying tdigest metrics using OPAL.
When to Use This Skill
- Calculating latency percentiles (p50, p95, p99) for services or endpoints
- Analyzing request duration distributions
- Setting or tracking SLOs (Service Level Objectives) based on percentiles
- Understanding performance characteristics beyond simple averages
- Working with any metric of type
tdigest - When you need accurate percentile calculations from pre-aggregated data
Prerequisites
- Access to Observe tenant via MCP
- Understanding that tdigest metrics are pre-aggregated percentile structures
- Metric dataset with type:
tdigest - Familiarity with percentiles (p50 = median, p95 = 95th percentile, etc.)
- Use
discover_context()to find and inspect tdigest metrics
Key Concepts
What Are TDigest Metrics?
TDigest (t-digest) is a probabilistic data structure for estimating percentiles efficiently:
Pre-aggregated percentile data: Not raw values, but compressed statistical summaries
- Stores distribution information in compact form
- Enables accurate percentile calculations
- Much more efficient than storing all raw values
Why percentiles matter:
- Averages hide outliers: A service with avg 100ms might have p99 at 10 seconds
- SLOs use percentiles: "p95 latency < 500ms" is a common SLO target
- User experience: p95/p99 show what real users experience, not just average case
Common Examples:
span_sn_service_node_duration_tdigest_5m- Service-to-service latency percentilesspan_sn_service_edge_duration_tdigest_5m- Edge latency percentilesrequest_duration_tdigest_5m- Request duration percentilesdatabase_query_duration_tdigest_5m- Database query latency percentiles
CRITICAL: The Double-Combine Pattern
TDigest metrics require a special pattern that's different from gauge metrics:
# WRONG - Missing second combine ❌
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(combined, 0.95)
# CORRECT - Double-combine pattern ✅
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why the double combine?
- First
tdigest_combine(inalign): Combines tdigest data points within time buckets - Second
tdigest_combine(inaggregate): Re-combines tdigests across groups/dimensions - Then
tdigest_quantile: Calculates the actual percentile value
Pattern breakdown:
align options(bins: 1),
combined:tdigest_combine(m_tdigest("metric_name")) ← First combine
aggregate p95:tdigest_quantile(
tdigest_combine(combined), ← Second combine (NESTED!)
0.95), ← Quantile value (0.0-1.0)
group_by(service_name)
Percentile Values
Percentiles are specified as decimal values from 0.0 to 1.0:
| Percentile | Value | Meaning |
|---|---|---|
| p50 (median) | 0.50 | 50% of values are below this |
| p75 | 0.75 | 75% of values are below this |
| p90 | 0.90 | 90% of values are below this |
| p95 | 0.95 | 95% of values are below this |
| p99 | 0.99 | 99% of values are below this |
| p99.9 | 0.999 | 99.9% of values are below this |
Common SLO targets: p95 < 500ms, p99 < 1000ms
Summary vs Time-Series (Same as Gauge Metrics)
| Output Type | Pattern | Result | Pipe? |
|---|---|---|---|
| Summary | options(bins: 1) |
One row per group | NO | |
| Time-Series | 5m, 1h |
Many rows per group | YES | |
Discovery Workflow
Step 1: Search for tdigest metrics
discover_context("duration tdigest", result_type="metric")
discover_context("latency percentile", result_type="metric")
Step 2: Get detailed metric schema
discover_context(metric_name="span_sn_service_node_duration_tdigest_5m")
Step 3: Verify metric type
Look for: Type: tdigest (critical!)
Step 4: Note available dimensions
Used for group_by():
service_name,for_service_nameenvironment,for_environment- etc. (shown in discovery output)
Step 5: Write query Use double-combine pattern with correct dimensions
Basic Patterns
Pattern 1: Overall Percentiles (No Grouping)
Calculate percentiles across all data:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99)
Output: Single row with overall p50, p95, p99 across entire time range.
Note: Both combines present, no group_by.
Pattern 2: Percentiles Per Service
Calculate percentiles broken down by dimension:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
Output: One row per service with percentiles.
Pattern 3: Single Percentile (Common for SLOs)
Get just p95 for SLO tracking:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| sort desc(p95)
| limit 10
Output: Top 10 services by p95 latency.
Use case: Identify slowest services for optimization.
Pattern 4: Converting Units
TDigest values are often in nanoseconds - convert for readability:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50_ns:tdigest_quantile(tdigest_combine(combined), 0.50),
p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
p99_ns:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
| make_col p50_ms:p50_ns / 1000000,
p95_ms:p95_ns / 1000000,
p99_ms:p99_ns / 1000000
Output: Percentiles in both nanoseconds and milliseconds.
Note: Check sample values in discover_context() to identify units.
Pattern 5: Time-Series Percentiles
Track percentiles over time buckets:
align 5m, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
Output: Multiple rows per service (one per 5-minute interval).
Note: Pipe | required for time-series pattern.
Use case: Dashboard charts showing latency trends over time.
Common Use Cases
SLO Tracking: p95 Latency Under Threshold
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| make_col p95_ms:p95_ns / 1000000
| make_col slo_target:500,
meets_slo:if(p95_ms < 500, "yes", "no")
| sort desc(p95_ms)
Use case: Check which services meet p95 < 500ms SLO target.
Output: Services with SLO compliance status.
Latency Distribution Analysis
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
p75:tdigest_quantile(tdigest_combine(combined), 0.75),
p90:tdigest_quantile(tdigest_combine(combined), 0.90),
p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
| make_col p50_ms:p50 / 1000000,
p95_ms:p95 / 1000000,
p99_ms:p99 / 1000000
Use case: Understand full latency distribution to identify outliers.
Insight: Large gap between p95 and p99 indicates inconsistent performance.
Comparing Services by Latency
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| make_col p95_ms:p95 / 1000000
| sort desc(p95_ms)
| limit 10
Use case: Find slowest services to prioritize optimization efforts.
Time-Series for Incident Investigation
align 5m, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| filter service_name = "frontend"
| make_col p95_ms:p95 / 1000000
Use case: See when latency spiked during an incident.
Output: Timeline of p95 latency for specific service.
Multi-Dimension Grouping
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name, environment)
| make_col p95_ms:p95 / 1000000
| sort desc(p95_ms)
Use case: Compare latency across services AND environments.
Complete Example
Scenario: You're tracking SLOs for your microservices. The target is p95 latency < 500ms and p99 latency < 1000ms for all production services.
Step 1: Discover tdigest metrics
discover_context("duration tdigest", result_type="metric")
Found: span_sn_service_node_duration_tdigest_5m (type: tdigest)
Step 2: Get metric details
discover_context(metric_name="span_sn_service_node_duration_tdigest_5m")
Available dimensions: service_name, environment, for_service_name
Step 3: Query for SLO compliance
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
p99_ns:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name, environment)
| make_col p95_ms:p95_ns / 1000000,
p99_ms:p99_ns / 1000000
| make_col p95_slo:if(p95_ms < 500, "✓", "✗"),
p99_slo:if(p99_ms < 1000, "✓", "✗")
| filter environment = "production"
| sort desc(p95_ms)
Step 4: Interpret results
| service_name | environment | p95_ms | p99_ms | p95_slo | p99_slo |
|---|---|---|---|---|---|
| frontend | production | 19373.5 | 5641328.2 | ✗ | ✗ |
| featureflagservice | production | 5838.8 | 7473.9 | ✗ | ✗ |
| cartservice | production | 4136.6 | 5898.3 | ✗ | ✗ |
| productcatalogservice | production | 257.0 | 313.1 | ✓ | ✓ |
| currencyservice | production | 54.1 | 125.1 | ✓ | ✓ |
Insight: Frontend, featureflagservice, and cartservice are violating SLOs - need optimization.
Step 5: Investigate frontend latency over time
align 1h, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
| filter service_name = "frontend"
| make_col p95_ms:p95 / 1000000, p99_ms:p99 / 1000000
Output: Hourly p95/p99 trends to identify when latency degraded.
Common Pitfalls
Pitfall 1: Forgetting Second Combine
❌ Wrong (most common mistake):
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(combined, 0.95)
✅ Correct:
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why: TDigest requires combining twice - once in align, once in aggregate.
Error message: "the field has to be aggregated or grouped"
Pitfall 2: Using m() Instead of m_tdigest()
❌ Wrong:
align options(bins: 1), combined:tdigest_combine(m("duration_tdigest_5m"))
✅ Correct:
align options(bins: 1), combined:tdigest_combine(m_tdigest("duration_tdigest_5m"))
Why: Tdigest metrics require m_tdigest() function, not m().
Check: Look for Type: tdigest in discover_context() output.
Pitfall 3: Wrong Pipe Usage (Same as Gauge)
❌ Wrong (pipe with bins:1):
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
✅ Correct:
# Summary - NO pipe
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
# Time-series - YES pipe
align 5m, combined:tdigest_combine(m_tdigest("metric"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Pitfall 4: Percentile Value Out of Range
❌ Wrong:
aggregate p95:tdigest_quantile(tdigest_combine(combined), 95)
✅ Correct:
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why: Quantile values must be 0.0 to 1.0 (not 1 to 100).
Pitfall 5: Not Converting Units
❌ Wrong (values in nanoseconds, hard to read):
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Result: p95 = 14675991.25 (what unit is this?)
✅ Correct (convert to milliseconds):
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95)
| make_col p95_ms:p95_ns / 1000000
Result: p95_ms = 14.68 (clearly milliseconds)
Tip: Check sample values in discovery to identify units (19-digit numbers = nanoseconds).
Percentile Reference
Common percentiles and their meanings:
| Percentile | Decimal | Meaning | Common Use |
|---|---|---|---|
| p50 | 0.50 | Median (middle value) | Typical user experience |
| p75 | 0.75 | 75th percentile | Better than average case |
| p90 | 0.90 | 90th percentile | Catching most outliers |
| p95 | 0.95 | 95th percentile | Standard SLO target |
| p99 | 0.99 | 99th percentile | Tail latency / worst 1% |
| p99.9 | 0.999 | 99.9th percentile | Extreme outliers |
SLO best practice: Track p95 and p99, not just averages.
Unit Conversion Reference
Common time unit conversions (assuming nanoseconds):
# Nanoseconds to milliseconds (most common)
make_col value_ms:value_ns / 1000000
# Nanoseconds to seconds
make_col value_sec:value_ns / 1000000000
# Nanoseconds to microseconds
make_col value_us:value_ns / 1000
How to identify units: Check sample values in discover_context():
- 19 digits (1760201545280843522) = nanoseconds
- 13 digits (1758543367916) = milliseconds
- 10 digits (1758543367) = seconds
Best Practices
- Always use double-combine pattern - most critical rule for tdigest
- Verify metric type - must be
tdigest(notgauge) - Check units - convert nanoseconds to milliseconds for readability
- Use multiple percentiles - p50, p95, p99 show full distribution
- Calculate SLO compliance - add derived columns comparing to targets
- Sort and limit - focus on worst offenders with
sort desc() | limit 10 - Use time-series for investigation - see when latency changed
- Group by relevant dimensions - service, environment, endpoint, etc.
Related Skills
- aggregating-gauge-metrics - For count/sum/avg metrics (NOT percentiles)
- working-with-intervals - For calculating percentiles from raw interval data (slower)
- time-series-analysis - For event/interval trending with timechart
Summary
TDigest metrics enable efficient percentile calculations:
- Core pattern:
align+m_tdigest()+ doubletdigest_combine+tdigest_quantile - Critical rule: Use
tdigest_combine()TWICE (in align AND in aggregate) - Metric function:
m_tdigest()(NOTm()) - Percentile values: 0.0 to 1.0 (0.95 = p95)
- Common percentiles: p50 (median), p95 (SLO), p99 (tail latency)
- Units: Often nanoseconds - convert to milliseconds for readability
Key distinction: TDigest metrics use special double-combine pattern, while gauge metrics use simple m() + aggregate.
Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (ServiceExplorer/Service Inspector Metrics)
Related Skills
Attack Tree Construction
Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
Grafana Dashboards
Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Matplotlib
Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.
Scientific Visualization
Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.
Seaborn
Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.
Shap
Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Query Writing
For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Scientific Visualization
Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.
