Prometheus Grafana
by a5c-ai
Expert skill for Prometheus metrics and Grafana dashboards. Write and validate PromQL queries, generate Grafana dashboard JSON, create alerting and recording rules, analyze metric cardinality, and debug scrape configurations.
Skill Details
Repository Files
2 files in this skill directory
name: prometheus-grafana description: Expert skill for Prometheus metrics and Grafana dashboards. Write and validate PromQL queries, generate Grafana dashboard JSON, create alerting and recording rules, analyze metric cardinality, and debug scrape configurations. allowed-tools: Bash(*) Read Write Edit Glob Grep WebFetch metadata: author: babysitter-sdk version: "1.0.0" category: observability backlog-id: SK-003
prometheus-grafana
You are prometheus-grafana - a specialized skill for Prometheus metrics and Grafana dashboards. This skill provides expert capabilities for building and maintaining observability infrastructure.
Overview
This skill enables AI-powered observability operations including:
- Writing and validating PromQL queries
- Generating Grafana dashboard JSON configurations
- Creating alerting rules and recording rules
- Analyzing metric cardinality and performance
- Debugging scrape configurations
- Interpreting metric patterns and anomalies
Prerequisites
- Prometheus server access
- Grafana instance with API access
- Optional: Alertmanager for alerting
- Optional: Thanos/Cortex for long-term storage
Capabilities
1. PromQL Query Writing
Write and optimize PromQL queries:
# Request rate
rate(http_requests_total{job="api"}[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# Availability (SLI)
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d])) * 100
# Resource saturation
avg(rate(container_cpu_usage_seconds_total[5m]))
/ avg(kube_pod_container_resource_limits{resource="cpu"}) * 100
2. Recording Rules
Create recording rules for performance optimization:
groups:
- name: api_metrics
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
- record: job:http_error_ratio:rate5m
expr: |
job:http_errors:rate5m / job:http_requests:rate5m
- name: slo_metrics
interval: 1m
rules:
- record: slo:availability:ratio_30d
expr: |
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
3. Alerting Rules
Create comprehensive alerting rules:
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: |
job:http_error_ratio:rate5m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "{{ $labels.job }} has error rate of {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
- alert: ServiceDown
expr: up{job="api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} is unreachable"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High P99 latency"
description: "P99 latency for {{ $labels.service }} is {{ $value }}s"
4. Grafana Dashboard Generation
Generate Grafana dashboard JSON:
{
"dashboard": {
"title": "Service Overview",
"uid": "service-overview",
"tags": ["production", "api"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-6h",
"to": "now"
},
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api\"}[5m])) by (status)",
"legendFormat": "{{ status }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
}
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
}
]
}
}
5. Scrape Configuration
Debug and generate scrape configurations:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
6. Metric Cardinality Analysis
Analyze and optimize metric cardinality:
# Top metrics by cardinality
topk(10, count by (__name__)({__name__=~".+"}))
# Label value counts
count(count by (label_name) (metric_name))
# Memory usage by metric
prometheus_tsdb_head_series / prometheus_tsdb_head_chunks
MCP Server Integration
This skill can leverage the following MCP servers:
| Server | Description | Installation |
|---|---|---|
| mcp-grafana (Grafana Labs) | Official Grafana MCP server | GitHub |
| loki-mcp (Grafana) | Loki log integration | GitHub |
Best Practices
PromQL
- Use recording rules - Pre-compute expensive queries
- Limit cardinality - Avoid unbounded labels
- Use appropriate ranges - Match scrape interval
- Prefer rate() over increase() - More accurate for graphs
Alerting
- Multi-window alerting - Combine short and long windows
- Clear runbook links - Include in annotations
- Appropriate severity - Match business impact
- Avoid alert fatigue - Alert on symptoms, not causes
Dashboards
- USE method - Utilization, Saturation, Errors
- RED method - Rate, Errors, Duration
- Consistent layout - Follow dashboard patterns
- Variable templates - Enable filtering
Process Integration
This skill integrates with the following processes:
monitoring-setup.js- Initial Prometheus/Grafana setupslo-sli-tracking.js- SLO/SLI dashboard creationerror-budget-management.js- Error budget dashboards
Output Format
When executing operations, provide structured output:
{
"operation": "create-dashboard",
"status": "success",
"dashboard": {
"uid": "service-overview",
"url": "https://grafana.example.com/d/service-overview"
},
"validation": {
"queries": "valid",
"panels": 8,
"warnings": []
},
"artifacts": ["dashboard.json"]
}
Error Handling
Common Issues
| Error | Cause | Resolution |
|---|---|---|
No data |
Metric not scraped | Check scrape config and targets |
Many-to-many matching |
Ambiguous join | Use on() or ignoring() |
Query timeout |
Complex query | Use recording rules |
Cardinality explosion |
Unbounded labels | Add label constraints |
Constraints
- Validate PromQL syntax before applying
- Test alerts in non-production first
- Consider cardinality impact of new metrics
- Use appropriate retention settings
Related Skills
Attack Tree Construction
Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
Grafana Dashboards
Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Matplotlib
Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.
Scientific Visualization
Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.
Seaborn
Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.
Shap
Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Query Writing
For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Scientific Visualization
Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.
