Tron Dashboard Creating
by meriley
Create production-ready Grafana dashboards for TRON team services including consumer dashboards, task metadata, RQ rules manager, and Kafka metrics. Use when building dashboards for task event consumers or RMS TRON services.
Skill Details
Repository Files
6 files in this skill directory
name: tron-dashboard-creating description: Create production-ready Grafana dashboards for TRON team services including consumer dashboards, task metadata, RQ rules manager, and Kafka metrics. Use when building dashboards for task event consumers or RMS TRON services. version: 1.0.0
TRON Dashboard Creation
Purpose
Create on-call friendly Grafana dashboards for TRON team services using grafanalib and axon_helpers. Implements Grafana best practices including health overview panels, tiered information design, and service-specific alert annotations.
When NOT to Use
- Generic API dashboards without Kafka consumers (use standard rms_helpers patterns)
- Non-RMS services (use cd/ or common/ helpers)
- Simple metric additions to existing dashboards
Quick Start: Minimal TRON Consumer Dashboard
from grafanalib.core import Template, Threshold
from axon_helpers.graph_helpers import AxonGraph, AxonSingleStat, GenDashboard, UNITS
from axon_helpers.rms_helpers import generate_dashboard_template_values
# Template variable for filtering by deployment type
deployment_type_template = Template(
name="deployment_type",
label="Deployment Type",
type="custom",
default="All",
includeAll=False,
query="All : .*,Internal : .*-internal.*,Customer : .*-customer.*",
options=[
{"selected": True, "text": "All", "value": ".*"},
{"selected": False, "text": "Internal", "value": ".*-internal.*"},
{"selected": False, "text": "Customer", "value": ".*-customer.*"},
],
)
rows = {
"Health Overview": [
AxonSingleStat(
title="Consumer Lag",
expressions=[{"expr": 'sum(kafka_consumergroup_group_topic_sum_lag{topic="task-events-public",group=~".*taskmetadatasvc-task-event-consumer.*"})'}],
thresholds=[
Threshold("green", 0, 0.0),
Threshold("orange", 1, 2500),
Threshold("red", 2, 5000),
],
reduceCalc="lastNotNull",
graphMode="area",
format="short",
span=4,
),
],
}
dashboard = GenDashboard(
title="My TRON Consumer Dashboard",
uid="my-tron-consumer-dashboard",
templating=generate_dashboard_template_values(
additional_templates_list=[deployment_type_template]
),
rows=rows,
)
Core Workflows
Workflow 1: Health Overview Dashboard (On-Call Triage)
Create a health overview row for instant on-call triage. This follows the KubeCon "Foolproof K8s Dashboards for Sleep-Deprived On-Calls" pattern.
Step 1: Define SLO Thresholds
# Define SLO constants at top of file (not hardcoded in panels)
CUSTOMER_LATENCY_WARNING_MS = 120000 # 2 minutes
CUSTOMER_LATENCY_CRITICAL_MS = 300000 # 5 minutes
CUSTOMER_LAG_WARNING = 2500
CUSTOMER_LAG_CRITICAL = 5000
DLQ_WARNING = 5
DLQ_CRITICAL = 10
Step 2: Create Health Stat Panels
from grafanalib.core import Threshold
def health_stat(title, description, expr, unit, thresholds, span=2):
"""Create a stat panel showing current metric value with thresholds."""
return AxonSingleStat(
title=title,
description=description,
expressions=[{"expr": expr, "legendFormat": title}],
thresholds=thresholds,
reduceCalc="lastNotNull",
graphMode="area",
format=unit,
decimals=1,
span=span,
)
# Health Overview panels (6 stat panels + timeline)
health_panels = [
health_stat(
title="Overall Health",
description="Health score 0-100% based on Kafka lag",
expr='(1 - clamp_max(sum(kafka_consumergroup_group_topic_sum_lag{...}) / 5000, 1)) * 100',
unit="percent",
thresholds=[
Threshold("red", 0, 0.0),
Threshold("orange", 1, 50.0),
Threshold("green", 2, 90.0),
],
),
health_stat(
title="Max Latency",
description="Max consumer latency in minutes",
expr='max(rms_taskmetadatasvc_task_event_consumer_read_latency{...}) / 60000',
unit="m",
thresholds=[
Threshold("green", 0, 0.0),
Threshold("orange", 1, 2.0),
Threshold("red", 2, 5.0),
],
),
# ... Consumer Lag, DLQ Rate, Throughput panels
]
Step 3: Add Timeline Graph
timeline = AxonGraph(
title="Kafka Lag Timeline",
expressions=[{
"expr": 'sum(kafka_consumergroup_group_topic_sum_lag{topic="task-events-public",...})',
"legendFormat": "Total Consumer Lag",
}],
thresholds=[
Threshold("green", 0, 0.0),
Threshold("orange", 1, float(CUSTOMER_LAG_WARNING)),
Threshold("red", 2, float(CUSTOMER_LAG_CRITICAL)),
],
thresholdsStyleMode="line+area",
span=12,
unit=UNITS.SHORT,
)
Validation:
- Health overview is first row (not collapsed)
- 5-6 stat panels cover: Health, Latency, Lag, DLQ, Throughput
- Thresholds use SLO constants (not hardcoded values)
- Timeline shows historical context
Workflow 2: Consumer Dashboard (Using ConsumerMetrics)
Use the ConsumerMetrics class for standard consumer dashboard sections.
Step 1: Import and Configure
from axon_helpers.rms_helpers import ConsumerMetrics, Query
from axon_helpers.utils import flatten
# Filter tags for internal vs customer
isInternalServiceTag = ', service=~".*-internal.*"'
isCustomerServiceTag = ', service!~".*-internal.*"'
isNotDLQTag = ', is_dlq!="true"'
Step 2: Create Consumer Section
# For Task Metadata Consumer (Internal)
internal_consumer_graphs = flatten(
ConsumerMetrics(
service_name="Task Metadata Consumer (Internal)",
container_name="taskmetadatasvc-task-event-consumer-internal",
consumer_group=".*taskmetadatasvc-task-event-consumer-internal",
fetch_latency_query=Query(
"rms_taskmetadatasvc_task_event_consumer_read_latency",
additional_expressions=isNotDLQTag + isInternalServiceTag,
),
process_latency_query=Query(
"rms_taskmetadatasvc_task_event_consumer_sync_latency",
additional_expressions=isNotDLQTag + isInternalServiceTag,
),
dlq_submitted_query=Query(
"rms_taskmetadatasvc_task_event_consumer_dlq_submitted_count",
additional_expressions=isNotDLQTag + isInternalServiceTag,
),
message_volume_query=Query(
"rms_taskmetadatasvc_task_event_consumer_processed_count",
additional_expressions=isNotDLQTag + isInternalServiceTag,
),
).generate_consumer_graphs()
)
Step 3: Add to Dashboard Rows
rows = {
"Health Overview": health_panels + [timeline],
"TASK Events: Task Metadata Consumer (Internal)": internal_consumer_graphs,
"TASK Events: Task Metadata Consumer (Customer)": customer_consumer_graphs,
}
Validation:
- Internal and Customer are separate rows (not combined)
- Proper filter tags applied (isInternalServiceTag, isCustomerServiceTag)
- DLQ metrics excluded from main consumer (is_dlq!="true")
Workflow 3: Alert-Linked Dashboard
Link dashboards to alerts for directed browsing (Grafana best practice).
Step 1: Define Service-Specific Alert Pattern
import grafanalib.core as G
# CRITICAL: Use narrow patterns, NOT generic like ".*[Ll]ag.*"
# Generic patterns match 90+ alerts and create solid annotation blocks
SERVICE_ALERT_PATTERN = (
"RMS.*TaskMetadataSvc.*|"
"RMS.*RQ.*Rule.*Manager.*|"
"RMS.*RuleManager.*"
)
Step 2: Create Alert Annotations
alert_annotations = G.Annotations(
list=[
{
"builtIn": 0,
"datasource": {"type": "prometheus", "uid": "${DataSource}"},
"enable": True,
"expr": f'ALERTS{{alertname=~"{SERVICE_ALERT_PATTERN}", alertstate="firing", axon_cluster=~"$axon_cluster"}}',
"hide": False,
"iconColor": "rgba(255, 120, 50, 0.25)", # Orange with 25% opacity
"name": "Service Alerts",
"titleFormat": "{{alertname}}",
"useValueForTime": False,
},
]
)
Step 3: Apply to Dashboard
dashboard = GenDashboard(
title="RMS TRON Consumer Dashboard",
# ... other config
annotations=alert_annotations,
)
Validation:
- Alert pattern is service-specific (not generic)
- Includes
axon_cluster=~"$axon_cluster"for environment filtering - Icon color has transparency (25% opacity for subtle background)
- Alert annotations appear on relevant panels only
Panel Selection Guide
| Metric Type | Panel | Best Practice Source |
|---|---|---|
| Current health/status | AxonSingleStat |
KubeCon: instant triage |
| Latency (p50/p90/p99) | AxonGraph (lines) |
RED Method: Duration |
| Message volume/rate | AxonGraph (bars) |
RED Method: Rate |
| Error/fault counts | AxonGraph (bars, stacked) |
USE Method: Errors |
| Kafka consumer lag | AxonGraph (lines) |
USE Method: Saturation |
| Top-K operations | AxonBarGauge |
Grafana best practices |
| HPA replica status | AxonGraph (lines) |
KubeCon: normalization |
See PATTERNS.md for complete code examples for each pattern.
Template Variables
Always include these variables:
from axon_helpers.rms_helpers import generate_dashboard_template_values
# Standard RMS template variables
templating = generate_dashboard_template_values(
additional_templates_list=[
deployment_type_template, # Internal vs Customer
consumer_group_template, # Kafka consumer group filter
]
)
Standard variables provided by generate_dashboard_template_values():
$DataSource- Prometheus/Cortex datasource$axon_cluster- Cluster/environment selector
See REFERENCE.md for custom template variable patterns.
Alert Setup
CRITICAL: Use service-specific patterns, NOT generic patterns
# WRONG - matches 90+ alerts, creates solid annotation blocks
ALERT_PATTERN = ".*[Ll]ag.*"
# CORRECT - matches only TRON service alerts
SERVICE_ALERT_PATTERN = (
"RMS.*TaskMetadataSvc.*|"
"RMS.*RQ.*Rule.*Manager.*|"
"RMS.*RuleManager.*"
)
See ALERTS.md for complete alert annotation patterns and threshold integration.
Row Organization (Best Practice)
rows = {
# TIER 1: Health Overview - NOT collapsed (on-call triage)
"Health Overview": health_panels + [timeline],
# TIER 2: Infrastructure - Collapsed by default
"Overview: Kafka Infrastructure": kafka_panels,
"Overview: Consumer Pod Replicas": replica_panels,
# TIER 3: Service-specific - Collapsed by default
"TASK Events: Rules Manager Consumers": task_event_panels,
"TASK Events: Task Metadata Consumer (Internal)": internal_panels,
"TASK Events: Task Metadata Consumer (Customer)": customer_panels,
# TIER 4: Advanced/DLQ - Collapsed by default
"DLQ Processors: All Event Types": dlq_panels,
}
dashboard = GenDashboard(
rows=rows,
rows_to_collapse_by_title={
"Overview: Kafka Infrastructure",
"Overview: Consumer Pod Replicas",
"TASK Events: Rules Manager Consumers",
"DLQ Processors: All Event Types",
},
)
Common Issues
Issue: Kafka metrics don't filter by $axon_cluster
Cause: Kafka exporter metrics don't have axon_cluster label
Solution: Use explicit consumer group patterns instead:
# Instead of axon_cluster, use explicit group pattern
kafka_lag_expr = 'kafka_consumergroup_group_topic_sum_lag{group=~".*taskmetadatasvc-task-event-consumer.*"}'
Issue: Alert annotations create solid blocks
Cause: Generic alert pattern matches too many alerts Solution: Use service-specific pattern (see ALERTS.md)
Issue: Internal and customer metrics overlap
Cause: Missing deployment type filter
Solution: Add service=~"$deployment_type" filter and split into separate rows
Resources
- REFERENCE.md - API reference for axon_helpers classes
- PATTERNS.md - Dashboard patterns with code examples
- ALERTS.md - Alert annotation patterns (PRIMARY)
- METRICS.md - RMS metric naming conventions
- EXAMPLES.md - Complete dashboard examples
Quick Reference
# Dashboard file location
/Users/mriley/projects/ops/grafana-telemetry/dashboards/default/services/rms/
# Generate dashboard
cd /Users/mriley/projects/ops/grafana-telemetry
make rms.rms_tron_consumers_v2.dashboard
# Example dashboard to reference
rms.rms_tron_consumers_v2.dashboard.py
Test Scenarios
Use these scenarios to validate skill invocation and output quality.
Scenario 1: Create Health Overview
Input: "Create a health overview for task metadata consumer"
Expected Output:
- Uses
health_stat()helper function pattern - 5-6 stat panels (Health, Latency, Lag, DLQ, Throughput, Pods)
- Timeline graph with threshold lines
- Thresholds use SLO constants (e.g.,
CUSTOMER_LAG_CRITICAL = 5000) - Row is NOT collapsed (health always visible)
Validation:
# Should see patterns like:
health_stat(title="Consumer Health", ...)
health_stat(title="P99 Latency", ...)
AxonGraph(title="Health Timeline", thresholds=[...])
Scenario 2: Add Alert Annotations
Input: "Add alert annotations to my TRON dashboard"
Expected Output:
- Uses
SERVICE_ALERT_PATTERNconstant (service-specific) - Pattern matches:
RMS.*TaskMetadataSvc.*|RMS.*RQ.*Rule.*Manager.* - Includes
axon_cluster=~"$axon_cluster"filter - Icon color has transparency (
rgba(255, 120, 50, 0.25))
Anti-patterns (should NOT see):
- ❌ Generic patterns like
.*[Ll]ag.*or.*[Cc]onsumer.* - ❌ Missing
axon_clusterfilter - ❌ Solid colors without transparency
Scenario 3: Create Consumer Section
Input: "Add task metadata consumer metrics to dashboard"
Expected Output:
- Uses
ConsumerMetricsclass fromaxon_helpers.rms_helpers - Separates internal and customer into different rows
- Applies filter tags (
isNotDLQTag,isInternalServiceTag,isCustomerServiceTag) - Uses
flatten()wrapper for panel lists - Includes latency (p50/p90/p99), volume, and DLQ panels
Validation:
# Should see patterns like:
from axon_helpers.rms_helpers import ConsumerMetrics
internal_metrics = ConsumerMetrics(
metric_prefix="rms_taskmetadatasvc_task_event_consumer",
filter_tags=isNotDLQTag + isInternalServiceTag,
)
Scenario 4: Kafka Lag Graph
Input: "Add Kafka lag graph for TRON consumers"
Expected Output:
- Uses
kafka_consumergroup_group_topic_sum_lagmetric - Does NOT filter by
axon_cluster(Kafka metrics don't have this label) - Filters by explicit consumer group pattern instead
- Groups by
grouplabel for breakdown
Validation:
# Should see pattern like:
sum by (group) (
kafka_consumergroup_group_topic_sum_lag{
group=~".*taskmetadatasvc-task-event-consumer.*"
}
)
Anti-pattern (should NOT see):
- ❌
axon_cluster=~"$axon_cluster"on Kafka metrics (will return no data)
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
