Coralogix Analysis
by incidentfox
Coralogix log analysis with DataPrime query language. Use when querying Coralogix logs, metrics, or traces. Provides syntax reference and intelligent investigation scripts.
Skill Details
Repository Files
12 files in this skill directory
name: coralogix-analysis description: Coralogix log analysis with DataPrime query language. Use when querying Coralogix logs, metrics, or traces. Provides syntax reference and intelligent investigation scripts.
Coralogix Analysis
Authentication
IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for CORALOGIX_API_KEY or other API keys in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.
Configuration environment variables you CAN check (non-secret):
CORALOGIX_DOMAIN- Team hostname (e.g.,myteam.app.cx498.coralogix.com)CORALOGIX_REGION- Region code (e.g.,us2,eu1) - fallback if domain not set
Region mapping (the scripts auto-detect based on domain):
- US1:
*.app.coralogix.us→api.us1.coralogix.com - US2:
*.app.cx498.coralogix.com→api.us2.coralogix.com - EU1:
*.coralogix.com→api.eu1.coralogix.com - EU2:
*.app.eu2.coralogix.com→api.eu2.coralogix.com - AP1:
*.app.coralogix.in→api.ap1.coralogix.com - AP2:
*.app.coralogixsg.com→api.ap2.coralogix.com
MANDATORY: Statistics-First Investigation
NEVER dump raw logs. Always follow this pattern:
STATISTICS → SAMPLE → SIGNATURES → CORRELATE
- Statistics First - Know volume, error rate, and top patterns before sampling
- Strategic Sampling - Choose the right strategy based on statistics
- Pattern Extraction - Cluster similar errors to find root causes
- Context Correlation - Investigate around anomaly timestamps
Available Scripts
All scripts are in .claude/skills/observability/coralogix/scripts/
PRIMARY INVESTIGATION SCRIPTS
get_statistics.py - ALWAYS START HERE
Comprehensive statistics with pattern extraction and anomaly detection.
python .claude/skills/observability/coralogix/scripts/get_statistics.py [--service SERVICE] [--app APP] [--time-range MINUTES]
# Examples:
python .claude/skills/observability/coralogix/scripts/get_statistics.py --time-range 60
python .claude/skills/observability/coralogix/scripts/get_statistics.py --service payment --app otel-demo
Output includes:
- Total count, error count, error rate percentage
- Severity distribution
- Top error patterns (crucial for quick triage)
- Time bucket anomalies (spike/drop detection via z-score)
- Top services by log volume
- Actionable recommendation
sample_logs.py - Strategic Sampling
Choose the right sampling strategy based on statistics.
python .claude/skills/observability/coralogix/scripts/sample_logs.py --strategy STRATEGY [--service SERVICE] [--app APP]
# Strategies:
# errors_only - Only ERROR/CRITICAL logs (default for incidents)
# around_anomaly - Logs within time window of specific timestamp
# first_last - First N/2 + last N/2 logs (timeline view)
# random - Random sample across time range
# all - All severity levels (use sparingly)
# Examples:
python .claude/skills/observability/coralogix/scripts/sample_logs.py --strategy errors_only --service payment
python .claude/skills/observability/coralogix/scripts/sample_logs.py --strategy around_anomaly --timestamp "2026-01-27T05:00:00Z" --window 60
python .claude/skills/observability/coralogix/scripts/sample_logs.py --strategy first_last --service checkout --limit 50
extract_signatures.py - Pattern Clustering
Normalize and cluster log messages to see unique issue patterns.
python .claude/skills/observability/coralogix/scripts/extract_signatures.py --service SERVICE [--severity SEVERITY] [--max-signatures N]
# Examples:
python .claude/skills/observability/coralogix/scripts/extract_signatures.py --service payment --severity ERROR
python .claude/skills/observability/coralogix/scripts/extract_signatures.py --app otel-demo --max-signatures 30
Normalizes variable parts (UUIDs, IPs, timestamps, numbers) to find:
- Dominant error patterns (> 50% = single root cause likely)
- Diverse errors (many patterns = multiple issues)
- Affected services per pattern
UTILITY SCRIPTS
list_services.py - Service Discovery
python .claude/skills/observability/coralogix/scripts/list_services.py [--time-range MINUTES]
get_health.py - Quick Health Check
python .claude/skills/observability/coralogix/scripts/get_health.py <service> [--time-range MINUTES]
get_errors.py - Quick Error Fetch
python .claude/skills/observability/coralogix/scripts/get_errors.py <service> [--app APPLICATION] [--time-range MINUTES]
query_logs.py - Raw DataPrime Queries
For custom queries not covered by other scripts.
python .claude/skills/observability/coralogix/scripts/query_logs.py "<dataprime_query>" [--time-range MINUTES] [--limit N]
DataPrime Syntax Quick Reference
Filters
# Equality (use == not =)
$l.subsystemname == 'api-server'
# Severity - use ENUM values (no quotes!)
# Valid: VERBOSE, DEBUG, INFO, WARNING, ERROR, CRITICAL
$m.severity == ERROR
$m.severity == WARNING || $m.severity == ERROR
# Text search (case-insensitive) - use ~~ not 'contains'
$d ~~ 'timeout'
$d ~~ 'connection refused'
# Combine filters with &&
$l.subsystemname == 'payment' && $m.severity == ERROR
Aggregations
# Count
| aggregate count() as total
# Group by field
| groupby $l.subsystemname aggregate count() as cnt
# Time bucketing
| timebucket 5m aggregate count() as cnt
# Multiple aggregations
| groupby $l.subsystemname aggregate count() as cnt, avg($d.duration) as avg_duration
# Order and limit
| orderby cnt desc | limit 20
Common Fields
$l.applicationname- Application/environment name (e.g., "otel-demo")$l.subsystemname- Service name (e.g., "payment", "checkout")$m.severity- Log level enum: VERBOSE, DEBUG, INFO, WARNING, ERROR, CRITICAL$m.timestamp- Event timestamp$d- Log message/data (use ~~ for text search)
Common Query Patterns
1. List all services with log counts
source logs | groupby $l.subsystemname aggregate count() as cnt | orderby cnt desc | limit 30
2. Error count by service
source logs | filter $m.severity == ERROR | groupby $l.subsystemname aggregate count() as errors | orderby errors desc
3. Error rate over time
source logs | filter $m.severity == ERROR | groupby $m.timestamp / 5m as bucket aggregate count() as errors | orderby bucket asc
4. Errors for specific service
source logs | filter $l.subsystemname == 'payment' | filter $m.severity == ERROR | limit 50
5. Search for specific error message
source logs | filter $d ~~ 'connection refused' | limit 20
Advanced DataPrime Patterns
Bracket Notation for Special Fields
K8s fields often have dots in names. Use bracket notation:
# Wrong - treats as nested path
$d.kubernetes.namespace
# Correct - literal field name with dot
$d['kubernetes.namespace']
$d['resource.attributes.k8s_pod_name']
Time-Based Comparisons
Compare logs before/after a time threshold:
# Count logs in last hour vs older
source logs | countby if($m.timestamp > now() - 1h, 'last_hour', 'older')
# Find logs older than 5 minutes
source logs | filter $m.timestamp < now() - 5m
K8s Container Restarts
Find unstable containers:
source logs
| choose resource.attributes.k8s_container_restart_count:number as restarts,
resource.attributes.k8s_container_name as container,
resource.attributes.k8s_deployment_name as deployment
| filter restarts > 0
| groupby deployment aggregate max(restarts) as max_restarts
| orderby max_restarts desc
Peak Error Window
Find the 10-minute window with most errors:
source logs
| filter $m.severity == ERROR
| groupby $m.timestamp / 10m as bucket aggregate count() as cnt
| orderby cnt desc
| limit 5
Fuzzy Search All Fields
When you don't know which field contains the value:
# Search all fields for text
source logs | filter $d ~~ 'connection refused'
# Or use wildfind
source logs | wildfind 'timeout'
Anti-Patterns to Avoid
- ❌ NEVER skip statistics -
get_statistics.pyis MANDATORY first step - ❌ Unbounded queries - Always specify time ranges and limits
- ❌ Quoting severity values - Use enum:
ERRORnot'ERROR' - ❌ Using 'contains' - Use ~~ operator for text search
- ❌ Missing application filter - For multi-tenant, filter by $l.applicationname
- ❌ Fetching all logs - Use sampling strategies, not
limit 10000 - ❌ Ignoring anomaly timestamps - Use
around_anomalyto investigate spikes - ❌ Reading logs without patterns - Always extract signatures for RCA
- ❌ Dot notation for K8s fields - Use bracket notation:
$d['k8s.pod.name']
Investigation Workflow
Standard Incident Investigation
┌─────────────────────────────────────────────────────────────┐
│ 1. STATISTICS FIRST (mandatory) │
│ python get_statistics.py --service <service> │
│ → Know volume, error rate, top patterns, anomalies │
└─────────────────────────────────────────────────────────────┘
│
▼
Dominant Issue?
┌─────────────┴─────────────┐
│ │
YES (>80% one pattern) NO (mixed errors)
│ │
▼ ▼
┌─────────────────────────────┐ ┌───────────────────────────────────────────┐
│ 2. FAST PATH │ │ 2. DEEP DIVE │
│ Sample errors directly │ │ python extract_signatures.py │
│ python sample_logs.py │ │ python sample_logs.py --strategy ... │
│ → Verify hypothesis │ │ → Cluster and analyze patterns │
└─────────────────────────────┘ └───────────────────────────────────────────┘
Example: Payment Service Investigation
# Step 1: Statistics first - ALWAYS
python .claude/skills/observability/coralogix/scripts/get_statistics.py --service payment --time-range 60
# Output: 15,432 logs, 847 errors (5.5%), top pattern: "Connection timeout to downstream"
# IF dominant pattern found:
# Step 2: Verify with samples
python .claude/skills/observability/coralogix/scripts/sample_logs.py --strategy errors_only --service payment --limit 10
Quick Commands Reference
| Goal | Command |
|---|---|
| Start investigation | get_statistics.py --service X |
| See error variety | extract_signatures.py --service X |
| Sample errors only | sample_logs.py --strategy errors_only --service X |
| Investigate spike | sample_logs.py --strategy around_anomaly --timestamp T |
| Timeline view | sample_logs.py --strategy first_last --service X |
| List all services | list_services.py |
| Custom query | `query_logs.py "source logs |
Trace Investigation
Use traces to understand request flow and latency across services.
When to Use Traces vs Logs
| Use Case | Tool |
|---|---|
| "What errors happened?" | Logs (get_statistics.py) |
| "Why is this request slow?" | Traces (get_slow_spans.py) |
| "Where did the request fail?" | Traces (get_traces.py) |
| "What's the service dependency?" | Traces (operation analysis) |
Trace Scripts
get_traces.py - Find Spans
# Get spans for a service
python .claude/skills/observability/coralogix/scripts/get_traces.py --service checkout --time-range 30
# Get all spans for a trace ID
python .claude/skills/observability/coralogix/scripts/get_traces.py --trace-id abc123def456
# Filter by operation
python .claude/skills/observability/coralogix/scripts/get_traces.py --operation "/api/checkout" --service checkout
get_slow_spans.py - Latency Analysis
# Find spans slower than 500ms
python .claude/skills/observability/coralogix/scripts/get_slow_spans.py --min-duration 500
# Find slow spans in specific service
python .claude/skills/observability/coralogix/scripts/get_slow_spans.py --min-duration 200 --service checkout
# Get latency statistics by service (recommended first step)
python .claude/skills/observability/coralogix/scripts/get_slow_spans.py --stats
DataPrime Spans Syntax
Spans use source spans but with different field names than logs:
# List spans for a service (use serviceName, not $l.subsystemname)
source spans | filter serviceName == 'checkout' | limit 50
# Find slow spans (duration in MICROSECONDS)
source spans | filter duration > 500000 | orderby duration desc | limit 20
# Get all spans for a trace (use top-level traceID)
source spans | filter traceID == 'abc123def456...' | limit 100
# Latency statistics by service
source spans | groupby serviceName aggregate avg(duration) as avg_dur, max(duration) as max_dur | orderby avg_dur desc
Span Fields Reference (different from logs!)
operationName- Operation name (e.g.,HTTP GET /checkout)serviceName- Service name (equivalent to logs'$l.subsystemname)applicationName- Application nameduration- Span duration in microsecondstraceID- Trace identifier (32-char hex)spanID- Span identifierparentId- Parent span ID (for trace tree)tags- Span metadata (e.g.,http.status_code,rpc.method)process.tags- Resource attributes (e.g.,k8s.pod.name)
Trace Investigation Workflow
┌─────────────────────────────────────────────────────────────┐
│ 1. CHECK LATENCY STATS │
│ python get_slow_spans.py --stats │
│ → See which services have high latency │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. FIND SLOW SPANS │
│ python get_slow_spans.py --min-duration 500 --service X │
│ → Get specific slow spans with trace IDs │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. TRACE FULL REQUEST │
│ python get_traces.py --trace-id <id> │
│ → See all spans in the slow request │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. CORRELATE WITH LOGS │
│ python sample_logs.py --strategy around_anomaly │
│ → Get logs around the same timestamp │
└─────────────────────────────────────────────────────────────┘
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
