Tron Dashboard Creating

by meriley

data

Create production-ready Grafana dashboards for TRON team services including consumer dashboards, task metadata, RQ rules manager, and Kafka metrics. Use when building dashboards for task event consumers or RMS TRON services.

Skill Details

Repository Files

6 files in this skill directory


name: tron-dashboard-creating description: Create production-ready Grafana dashboards for TRON team services including consumer dashboards, task metadata, RQ rules manager, and Kafka metrics. Use when building dashboards for task event consumers or RMS TRON services. version: 1.0.0

TRON Dashboard Creation

Purpose

Create on-call friendly Grafana dashboards for TRON team services using grafanalib and axon_helpers. Implements Grafana best practices including health overview panels, tiered information design, and service-specific alert annotations.

When NOT to Use

  • Generic API dashboards without Kafka consumers (use standard rms_helpers patterns)
  • Non-RMS services (use cd/ or common/ helpers)
  • Simple metric additions to existing dashboards

Quick Start: Minimal TRON Consumer Dashboard

from grafanalib.core import Template, Threshold
from axon_helpers.graph_helpers import AxonGraph, AxonSingleStat, GenDashboard, UNITS
from axon_helpers.rms_helpers import generate_dashboard_template_values

# Template variable for filtering by deployment type
deployment_type_template = Template(
    name="deployment_type",
    label="Deployment Type",
    type="custom",
    default="All",
    includeAll=False,
    query="All : .*,Internal : .*-internal.*,Customer : .*-customer.*",
    options=[
        {"selected": True, "text": "All", "value": ".*"},
        {"selected": False, "text": "Internal", "value": ".*-internal.*"},
        {"selected": False, "text": "Customer", "value": ".*-customer.*"},
    ],
)

rows = {
    "Health Overview": [
        AxonSingleStat(
            title="Consumer Lag",
            expressions=[{"expr": 'sum(kafka_consumergroup_group_topic_sum_lag{topic="task-events-public",group=~".*taskmetadatasvc-task-event-consumer.*"})'}],
            thresholds=[
                Threshold("green", 0, 0.0),
                Threshold("orange", 1, 2500),
                Threshold("red", 2, 5000),
            ],
            reduceCalc="lastNotNull",
            graphMode="area",
            format="short",
            span=4,
        ),
    ],
}

dashboard = GenDashboard(
    title="My TRON Consumer Dashboard",
    uid="my-tron-consumer-dashboard",
    templating=generate_dashboard_template_values(
        additional_templates_list=[deployment_type_template]
    ),
    rows=rows,
)

Core Workflows

Workflow 1: Health Overview Dashboard (On-Call Triage)

Create a health overview row for instant on-call triage. This follows the KubeCon "Foolproof K8s Dashboards for Sleep-Deprived On-Calls" pattern.

Step 1: Define SLO Thresholds

# Define SLO constants at top of file (not hardcoded in panels)
CUSTOMER_LATENCY_WARNING_MS = 120000   # 2 minutes
CUSTOMER_LATENCY_CRITICAL_MS = 300000  # 5 minutes
CUSTOMER_LAG_WARNING = 2500
CUSTOMER_LAG_CRITICAL = 5000
DLQ_WARNING = 5
DLQ_CRITICAL = 10

Step 2: Create Health Stat Panels

from grafanalib.core import Threshold

def health_stat(title, description, expr, unit, thresholds, span=2):
    """Create a stat panel showing current metric value with thresholds."""
    return AxonSingleStat(
        title=title,
        description=description,
        expressions=[{"expr": expr, "legendFormat": title}],
        thresholds=thresholds,
        reduceCalc="lastNotNull",
        graphMode="area",
        format=unit,
        decimals=1,
        span=span,
    )

# Health Overview panels (6 stat panels + timeline)
health_panels = [
    health_stat(
        title="Overall Health",
        description="Health score 0-100% based on Kafka lag",
        expr='(1 - clamp_max(sum(kafka_consumergroup_group_topic_sum_lag{...}) / 5000, 1)) * 100',
        unit="percent",
        thresholds=[
            Threshold("red", 0, 0.0),
            Threshold("orange", 1, 50.0),
            Threshold("green", 2, 90.0),
        ],
    ),
    health_stat(
        title="Max Latency",
        description="Max consumer latency in minutes",
        expr='max(rms_taskmetadatasvc_task_event_consumer_read_latency{...}) / 60000',
        unit="m",
        thresholds=[
            Threshold("green", 0, 0.0),
            Threshold("orange", 1, 2.0),
            Threshold("red", 2, 5.0),
        ],
    ),
    # ... Consumer Lag, DLQ Rate, Throughput panels
]

Step 3: Add Timeline Graph

timeline = AxonGraph(
    title="Kafka Lag Timeline",
    expressions=[{
        "expr": 'sum(kafka_consumergroup_group_topic_sum_lag{topic="task-events-public",...})',
        "legendFormat": "Total Consumer Lag",
    }],
    thresholds=[
        Threshold("green", 0, 0.0),
        Threshold("orange", 1, float(CUSTOMER_LAG_WARNING)),
        Threshold("red", 2, float(CUSTOMER_LAG_CRITICAL)),
    ],
    thresholdsStyleMode="line+area",
    span=12,
    unit=UNITS.SHORT,
)

Validation:

  • Health overview is first row (not collapsed)
  • 5-6 stat panels cover: Health, Latency, Lag, DLQ, Throughput
  • Thresholds use SLO constants (not hardcoded values)
  • Timeline shows historical context

Workflow 2: Consumer Dashboard (Using ConsumerMetrics)

Use the ConsumerMetrics class for standard consumer dashboard sections.

Step 1: Import and Configure

from axon_helpers.rms_helpers import ConsumerMetrics, Query
from axon_helpers.utils import flatten

# Filter tags for internal vs customer
isInternalServiceTag = ', service=~".*-internal.*"'
isCustomerServiceTag = ', service!~".*-internal.*"'
isNotDLQTag = ', is_dlq!="true"'

Step 2: Create Consumer Section

# For Task Metadata Consumer (Internal)
internal_consumer_graphs = flatten(
    ConsumerMetrics(
        service_name="Task Metadata Consumer (Internal)",
        container_name="taskmetadatasvc-task-event-consumer-internal",
        consumer_group=".*taskmetadatasvc-task-event-consumer-internal",
        fetch_latency_query=Query(
            "rms_taskmetadatasvc_task_event_consumer_read_latency",
            additional_expressions=isNotDLQTag + isInternalServiceTag,
        ),
        process_latency_query=Query(
            "rms_taskmetadatasvc_task_event_consumer_sync_latency",
            additional_expressions=isNotDLQTag + isInternalServiceTag,
        ),
        dlq_submitted_query=Query(
            "rms_taskmetadatasvc_task_event_consumer_dlq_submitted_count",
            additional_expressions=isNotDLQTag + isInternalServiceTag,
        ),
        message_volume_query=Query(
            "rms_taskmetadatasvc_task_event_consumer_processed_count",
            additional_expressions=isNotDLQTag + isInternalServiceTag,
        ),
    ).generate_consumer_graphs()
)

Step 3: Add to Dashboard Rows

rows = {
    "Health Overview": health_panels + [timeline],
    "TASK Events: Task Metadata Consumer (Internal)": internal_consumer_graphs,
    "TASK Events: Task Metadata Consumer (Customer)": customer_consumer_graphs,
}

Validation:

  • Internal and Customer are separate rows (not combined)
  • Proper filter tags applied (isInternalServiceTag, isCustomerServiceTag)
  • DLQ metrics excluded from main consumer (is_dlq!="true")

Workflow 3: Alert-Linked Dashboard

Link dashboards to alerts for directed browsing (Grafana best practice).

Step 1: Define Service-Specific Alert Pattern

import grafanalib.core as G

# CRITICAL: Use narrow patterns, NOT generic like ".*[Ll]ag.*"
# Generic patterns match 90+ alerts and create solid annotation blocks
SERVICE_ALERT_PATTERN = (
    "RMS.*TaskMetadataSvc.*|"
    "RMS.*RQ.*Rule.*Manager.*|"
    "RMS.*RuleManager.*"
)

Step 2: Create Alert Annotations

alert_annotations = G.Annotations(
    list=[
        {
            "builtIn": 0,
            "datasource": {"type": "prometheus", "uid": "${DataSource}"},
            "enable": True,
            "expr": f'ALERTS{{alertname=~"{SERVICE_ALERT_PATTERN}", alertstate="firing", axon_cluster=~"$axon_cluster"}}',
            "hide": False,
            "iconColor": "rgba(255, 120, 50, 0.25)",  # Orange with 25% opacity
            "name": "Service Alerts",
            "titleFormat": "{{alertname}}",
            "useValueForTime": False,
        },
    ]
)

Step 3: Apply to Dashboard

dashboard = GenDashboard(
    title="RMS TRON Consumer Dashboard",
    # ... other config
    annotations=alert_annotations,
)

Validation:

  • Alert pattern is service-specific (not generic)
  • Includes axon_cluster=~"$axon_cluster" for environment filtering
  • Icon color has transparency (25% opacity for subtle background)
  • Alert annotations appear on relevant panels only

Panel Selection Guide

Metric Type Panel Best Practice Source
Current health/status AxonSingleStat KubeCon: instant triage
Latency (p50/p90/p99) AxonGraph (lines) RED Method: Duration
Message volume/rate AxonGraph (bars) RED Method: Rate
Error/fault counts AxonGraph (bars, stacked) USE Method: Errors
Kafka consumer lag AxonGraph (lines) USE Method: Saturation
Top-K operations AxonBarGauge Grafana best practices
HPA replica status AxonGraph (lines) KubeCon: normalization

See PATTERNS.md for complete code examples for each pattern.


Template Variables

Always include these variables:

from axon_helpers.rms_helpers import generate_dashboard_template_values

# Standard RMS template variables
templating = generate_dashboard_template_values(
    additional_templates_list=[
        deployment_type_template,  # Internal vs Customer
        consumer_group_template,   # Kafka consumer group filter
    ]
)

Standard variables provided by generate_dashboard_template_values():

  • $DataSource - Prometheus/Cortex datasource
  • $axon_cluster - Cluster/environment selector

See REFERENCE.md for custom template variable patterns.


Alert Setup

CRITICAL: Use service-specific patterns, NOT generic patterns

# WRONG - matches 90+ alerts, creates solid annotation blocks
ALERT_PATTERN = ".*[Ll]ag.*"

# CORRECT - matches only TRON service alerts
SERVICE_ALERT_PATTERN = (
    "RMS.*TaskMetadataSvc.*|"
    "RMS.*RQ.*Rule.*Manager.*|"
    "RMS.*RuleManager.*"
)

See ALERTS.md for complete alert annotation patterns and threshold integration.


Row Organization (Best Practice)

rows = {
    # TIER 1: Health Overview - NOT collapsed (on-call triage)
    "Health Overview": health_panels + [timeline],

    # TIER 2: Infrastructure - Collapsed by default
    "Overview: Kafka Infrastructure": kafka_panels,
    "Overview: Consumer Pod Replicas": replica_panels,

    # TIER 3: Service-specific - Collapsed by default
    "TASK Events: Rules Manager Consumers": task_event_panels,
    "TASK Events: Task Metadata Consumer (Internal)": internal_panels,
    "TASK Events: Task Metadata Consumer (Customer)": customer_panels,

    # TIER 4: Advanced/DLQ - Collapsed by default
    "DLQ Processors: All Event Types": dlq_panels,
}

dashboard = GenDashboard(
    rows=rows,
    rows_to_collapse_by_title={
        "Overview: Kafka Infrastructure",
        "Overview: Consumer Pod Replicas",
        "TASK Events: Rules Manager Consumers",
        "DLQ Processors: All Event Types",
    },
)

Common Issues

Issue: Kafka metrics don't filter by $axon_cluster

Cause: Kafka exporter metrics don't have axon_cluster label Solution: Use explicit consumer group patterns instead:

# Instead of axon_cluster, use explicit group pattern
kafka_lag_expr = 'kafka_consumergroup_group_topic_sum_lag{group=~".*taskmetadatasvc-task-event-consumer.*"}'

Issue: Alert annotations create solid blocks

Cause: Generic alert pattern matches too many alerts Solution: Use service-specific pattern (see ALERTS.md)

Issue: Internal and customer metrics overlap

Cause: Missing deployment type filter Solution: Add service=~"$deployment_type" filter and split into separate rows


Resources

  • REFERENCE.md - API reference for axon_helpers classes
  • PATTERNS.md - Dashboard patterns with code examples
  • ALERTS.md - Alert annotation patterns (PRIMARY)
  • METRICS.md - RMS metric naming conventions
  • EXAMPLES.md - Complete dashboard examples

Quick Reference

# Dashboard file location
/Users/mriley/projects/ops/grafana-telemetry/dashboards/default/services/rms/

# Generate dashboard
cd /Users/mriley/projects/ops/grafana-telemetry
make rms.rms_tron_consumers_v2.dashboard

# Example dashboard to reference
rms.rms_tron_consumers_v2.dashboard.py

Test Scenarios

Use these scenarios to validate skill invocation and output quality.

Scenario 1: Create Health Overview

Input: "Create a health overview for task metadata consumer"

Expected Output:

  • Uses health_stat() helper function pattern
  • 5-6 stat panels (Health, Latency, Lag, DLQ, Throughput, Pods)
  • Timeline graph with threshold lines
  • Thresholds use SLO constants (e.g., CUSTOMER_LAG_CRITICAL = 5000)
  • Row is NOT collapsed (health always visible)

Validation:

# Should see patterns like:
health_stat(title="Consumer Health", ...)
health_stat(title="P99 Latency", ...)
AxonGraph(title="Health Timeline", thresholds=[...])

Scenario 2: Add Alert Annotations

Input: "Add alert annotations to my TRON dashboard"

Expected Output:

  • Uses SERVICE_ALERT_PATTERN constant (service-specific)
  • Pattern matches: RMS.*TaskMetadataSvc.*|RMS.*RQ.*Rule.*Manager.*
  • Includes axon_cluster=~"$axon_cluster" filter
  • Icon color has transparency (rgba(255, 120, 50, 0.25))

Anti-patterns (should NOT see):

  • ❌ Generic patterns like .*[Ll]ag.* or .*[Cc]onsumer.*
  • ❌ Missing axon_cluster filter
  • ❌ Solid colors without transparency

Scenario 3: Create Consumer Section

Input: "Add task metadata consumer metrics to dashboard"

Expected Output:

  • Uses ConsumerMetrics class from axon_helpers.rms_helpers
  • Separates internal and customer into different rows
  • Applies filter tags (isNotDLQTag, isInternalServiceTag, isCustomerServiceTag)
  • Uses flatten() wrapper for panel lists
  • Includes latency (p50/p90/p99), volume, and DLQ panels

Validation:

# Should see patterns like:
from axon_helpers.rms_helpers import ConsumerMetrics

internal_metrics = ConsumerMetrics(
    metric_prefix="rms_taskmetadatasvc_task_event_consumer",
    filter_tags=isNotDLQTag + isInternalServiceTag,
)

Scenario 4: Kafka Lag Graph

Input: "Add Kafka lag graph for TRON consumers"

Expected Output:

  • Uses kafka_consumergroup_group_topic_sum_lag metric
  • Does NOT filter by axon_cluster (Kafka metrics don't have this label)
  • Filters by explicit consumer group pattern instead
  • Groups by group label for breakdown

Validation:

# Should see pattern like:
sum by (group) (
    kafka_consumergroup_group_topic_sum_lag{
        group=~".*taskmetadatasvc-task-event-consumer.*"
    }
)

Anti-pattern (should NOT see):

  • axon_cluster=~"$axon_cluster" on Kafka metrics (will return no data)

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Version:1.0.0
Last Updated:1/14/2026