Detecting Anomalies

by rustomax

data

Detect anomalies in metrics and time-series data using OPAL statistical methods. Use when you need to identify unusual patterns, spikes, drops, or outliers in observability data. Covers statistical outlier detection (Z-score, IQR), threshold-based alerts, rate-of-change detection with window functions, and moving average baselines. Choose pattern based on data distribution and anomaly type.

Skill Details

Repository Files

1 file in this skill directory


name: detecting-anomalies description: Detect anomalies in metrics and time-series data using OPAL statistical methods. Use when you need to identify unusual patterns, spikes, drops, or outliers in observability data. Covers statistical outlier detection (Z-score, IQR), threshold-based alerts, rate-of-change detection with window functions, and moving average baselines. Choose pattern based on data distribution and anomaly type.

Detecting Anomalies

Detect anomalies in metrics and time-series data using OPAL statistical methods. This skill covers multiple detection patterns for different types of anomalies: statistical outliers, sudden spikes/drops, threshold violations, and deviations from moving baselines.

Use when you need to:

  • Identify unusual spikes or drops in request volume, errors, latency
  • Detect values exceeding normal statistical bounds
  • Alert on sudden percentage changes (traffic doubling, sudden drops)
  • Compare current values to moving averages
  • Find outliers in skewed distributions

Key Concepts

Anomaly Detection Approaches

Statistical Methods (good for gradual changes):

  • Z-Score (standard deviation) - Assumes normal distribution
  • IQR (Interquartile Range) - Robust to skewed data
  • Percentile thresholds - Compare to historical baseline

Temporal Methods (good for sudden changes):

  • Rate of change - Detect sudden spikes/drops
  • Moving average deviation - Compare to recent baseline

Threshold Methods (simple and interpretable):

  • Static thresholds - Known limits (CPU > 90%)
  • Dynamic thresholds - Calculated from baseline (current > avg * 1.5)

When to Use Each Pattern

What type of anomaly?
├─ Known threshold (e.g., "CPU > 90%")
│  └─> Threshold-Based Detection (Pattern 3)
│
├─ Statistical outliers (unusual values)
│  ├─ Normal distribution?
│  │  └─> Z-Score Method (Pattern 1)
│  │
│  └─ Skewed distribution?
│     └─> IQR Method (Pattern 2)
│
├─ Sudden spikes/drops
│  └─> Rate of Change (Pattern 4)
│
└─ Deviation from recent baseline
   └─> Moving Average (Pattern 5)

Pattern 1: Statistical Outlier Detection (Z-Score)

Concept: Detect values beyond N standard deviations from the mean

When to use:

  • Metrics with relatively stable baseline
  • Data roughly follows normal distribution
  • Need statistically grounded detection

Query:

align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate avg_val:avg(metric_value),
          stddev_val:stddev(metric_value),
          current_val:sum(metric_value),
          group_by(service_name)
| make_col z_score:(current_val - avg_val) / stddev_val
| make_col upper_bound:avg_val + (2 * stddev_val)
| make_col lower_bound:avg_val - (2 * stddev_val)
| make_col is_anomaly:if(z_score > 2 or z_score < -2, true, false)
| filter is_anomaly = true
| sort desc(z_score)
| limit 20

Threshold tuning:

  • z > 2 or z < -2: ~95% confidence (moderate sensitivity)
  • z > 3 or z < -3: ~99.7% confidence (low false positives)
  • z > 1.5 or z < -1.5: ~87% confidence (high sensitivity)

Example result:

service_name: featureflagservice
avg_val: 11.5
stddev_val: 13.9
current_val: 46
z_score: 2.48
is_anomaly: true

Pros:

  • Statistically grounded
  • Well-understood confidence intervals
  • Good for normally distributed data

Cons:

  • Assumes normal distribution
  • Sensitive to extreme outliers in baseline
  • Requires sufficient historical data

Pattern 2: IQR (Interquartile Range) Method

Concept: Detect values beyond the interquartile range using Tukey's fences

When to use:

  • Skewed distributions (latency, error counts)
  • Presence of natural outliers in baseline
  • More robust alternative to Z-score

Query:

align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate p25:percentile(metric_value, 0.25),
          p75:percentile(metric_value, 0.75),
          current_val:sum(metric_value),
          group_by(service_name)
| make_col iqr:p75 - p25
| make_col upper_fence:p75 + (1.5 * iqr)
| make_col lower_fence:p25 - (1.5 * iqr)
| make_col is_outlier:if(current_val > upper_fence or current_val < lower_fence, true, false)
| filter is_outlier = true
| sort desc(current_val)
| limit 20

Threshold tuning:

  • 1.5 * IQR: Standard outliers (moderate sensitivity)
  • 3 * IQR: Extreme outliers (low false positives)
  • 1 * IQR: More sensitive detection

Example result:

service_name: featureflagservice
p25: 1.75
p75: 16.75
iqr: 15
upper_fence: 39.25
current_val: 46
is_outlier: true

Pros:

  • Robust to skewed distributions
  • Not affected by extreme values
  • Based on quartiles (median-based)

Cons:

  • Less interpretable than Z-score
  • May miss anomalies in heavy-tailed distributions
  • Requires sufficient data for percentile calculation

Pattern 3: Threshold-Based Detection

Concept: Simple comparison against fixed or dynamic thresholds

When to use:

  • Known capacity limits (CPU > 90%, memory > 80%)
  • SLO violations (error rate > 1%, latency > 500ms)
  • Business rules (orders < 100 per hour)

Static Threshold:

align options(bins: 1), total_calls:sum(m("span_call_count_5m"))
aggregate current_rate:sum(total_calls), group_by(service_name)
make_col threshold:100000
| make_col is_high:if(current_rate > threshold, true, false)
| filter is_high = true
| sort desc(current_rate)

Dynamic Threshold (baseline comparison):

align options(bins: 1), metric_value:sum(m("span_call_count_5m"))
aggregate baseline:avg(metric_value),
          current:sum(metric_value),
          group_by(service_name)
make_col threshold:baseline * 1.5
| make_col is_anomaly:if(current > threshold, true, false)
| filter is_anomaly = true

Threshold multiplier guidance:

  • 1.5x: High sensitivity (more alerts)
  • 2x: Moderate sensitivity (balanced)
  • 3x: Low sensitivity (only major spikes)

Pros:

  • Simple and interpretable
  • No assumptions about distribution
  • Clear business meaning

Cons:

  • Requires domain knowledge to set thresholds
  • Static thresholds may not adapt to changing baselines
  • May miss subtle anomalies

Pattern 4: Rate of Change Detection

Concept: Detect sudden spikes or drops by comparing to previous time period

When to use:

  • Detect sudden traffic spikes or drops
  • Identify rapid changes in behavior
  • Alert on percentage change thresholds

Query:

align 5m, metric_value:sum(m("span_call_count_5m"))
| make_col previous_value:window(lag(metric_value, 1), group_by(service_name))
| make_col value_change:metric_value - previous_value
| make_col pct_change:if(previous_value > 0, (value_change / previous_value) * 100, 0)
| make_col is_spike:if(pct_change > 100 or pct_change < -50, true, false)
| filter is_spike = true
| sort desc(pct_change)
| limit 20

Critical syntax: Use window(lag(...), group_by(...)) NOT lag(...) over (partition by...)

Threshold examples:

  • pct_change > 100: 2x increase (doubling)
  • pct_change > 200: 3x increase
  • pct_change < -50: 50% drop
  • pct_change < -75: 75% drop

Example result:

service_name: frontend
metric_value: 50
previous_value: 2
value_change: 48
pct_change: 2400
is_spike: true

Pros:

  • Detects sudden changes regardless of absolute value
  • Adapts to current baseline automatically
  • Effective for early spike detection

Cons:

  • Sensitive to very low baseline values (small numbers can cause large percentage changes)
  • May produce false positives during normal ramp-up/down
  • Requires at least 2 time periods of data

Best practices:

  • Add minimum value filter to avoid division by small numbers
  • Use different thresholds for increases vs decreases
  • Consider absolute change threshold in addition to percentage

Pattern 5: Moving Average Baseline

Concept: Compare current value to recent moving average using sliding window

When to use:

  • Smooth noisy metrics for baseline
  • Detect deviations from recent average
  • Adaptive baseline that follows trends

Query:

align 5m, metric_value:sum(m("span_call_count_5m"))
| make_col moving_avg:window(avg(metric_value), group_by(service_name), frame(back:30m))
| make_col deviation:metric_value - moving_avg
| make_col pct_deviation:if(moving_avg > 0, (deviation / moving_avg) * 100, 0)
| make_col is_anomaly:if(pct_deviation > 50 or pct_deviation < -50, true, false)
| filter is_anomaly = true
| sort desc(pct_deviation)
| limit 20

Frame options:

  • frame(back:10m): Short-term baseline (10-minute average)
  • frame(back:30m): Medium-term baseline (30-minute average)
  • frame(back:1h): Longer-term baseline (1-hour average)

Deviation thresholds:

  • > 50%: Moderate deviation from recent average
  • > 100%: Doubling compared to recent average
  • > 25%: More sensitive detection

Pros:

  • Adapts to changing baselines and trends
  • Smooths noisy data
  • Good for metrics with daily/hourly patterns

Cons:

  • Slower to detect anomalies (due to averaging)
  • May miss anomalies during rapid baseline shifts
  • Requires sufficient lookback data

Best practices:

  • Choose frame duration based on metric volatility
  • Shorter frames for fast-changing metrics
  • Longer frames for more stable baselines

Pattern 6: Percentile-Based Threshold

Concept: Compare current value to historical percentile (p95, p99)

When to use:

  • SLO violations (latency > p95)
  • Detect values above "normal high"
  • Comparing current to historical baseline

Query:

align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate p95:percentile(metric_value, 0.95),
          p99:percentile(metric_value, 0.99),
          current:sum(metric_value),
          group_by(service_name)
| make_col is_anomaly:if(current > p95, true, false)
| filter is_anomaly = true
| sort desc(current)

Percentile choices:

  • p95: Detect top 5% unusual values (moderate sensitivity)
  • p99: Detect top 1% extreme values (low false positives)
  • p90: Detect top 10% (high sensitivity)

Pros:

  • Percentile-based SLOs are industry standard
  • Automatically adapts to data distribution
  • Clear meaning (top X% of values)

Cons:

  • Unidirectional (only detects high values, not drops)
  • Requires sufficient historical data
  • May not detect subtle shifts in distribution

Common Patterns

Pattern: Combine Multiple Detection Methods

Increase confidence by requiring multiple methods to agree:

align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate avg_val:avg(metric_value),
          stddev_val:stddev(metric_value),
          p95:percentile(metric_value, 0.95),
          current:sum(metric_value),
          group_by(service_name)
| make_col z_score:(current - avg_val) / stddev_val
| make_col is_zscore_anomaly:if(z_score > 2 or z_score < -2, true, false)
| make_col is_percentile_anomaly:if(current > p95, true, false)
| make_col is_anomaly:if(is_zscore_anomaly = true and is_percentile_anomaly = true, true, false)
| filter is_anomaly = true

Use case: Reduce false positives by requiring consensus

Pattern: Multi-Metric Correlation

Detect anomalies across correlated metrics:

align options(bins: 1),
  requests:sum(m("span_call_count_5m")),
  errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
make_col error_rate:if(total_requests > 0, (float64(total_errors) / float64(total_requests)) * 100, 0)
| make_col threshold:1.0
| make_col is_high_error:if(error_rate > threshold and total_requests > 100, true, false)
| filter is_high_error = true

Use case: Alert when error rate AND request volume both indicate issues

Pattern: Time-Series Trending

Track anomalies over time using timechart:

align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate avg_val:avg(metric_value),
          stddev_val:stddev(metric_value),
          current:sum(metric_value),
          group_by(service_name)
| make_col z_score:(current - avg_val) / stddev_val
| make_col is_anomaly:if(z_score > 2 or z_score < -2, true, false)
| filter is_anomaly = true

Result: Multiple rows per service showing anomalies across time buckets

Use case: Visualize when and how often anomalies occur

OPAL Syntax Key Points

Window Functions (LAG/LEAD)

CRITICAL: OPAL uses window() function, NOT SQL OVER clause!

✅ CORRECT Syntax:

make_col prev:window(lag(column, 1), group_by(dimension))
make_col next:window(lead(column, 1), group_by(dimension))
make_col moving_avg:window(avg(column), group_by(dimension), frame(back:30m))

❌ WRONG Syntax (SQL-style):

lag(column, 1) over (partition by dimension order by time)

Window function components:

  • lag(column, offset): Access previous row value
  • lead(column, offset): Access next row value
  • group_by(dimension): Partition by dimension
  • frame(back:duration): Sliding window lookback period

Derived Columns Must Use Separate make_col

❌ WRONG - Cannot reference derived column in same make_col:

make_col upper_bound:avg + (2 * stddev),
         is_anomaly:if(value > upper_bound, true, false)

✅ CORRECT - Use separate make_col statements:

make_col upper_bound:avg + (2 * stddev)
| make_col is_anomaly:if(value > upper_bound, true, false)

Metrics Query Patterns

Summary (one row per group):

align options(bins: 1), metric:sum(m("metric_name"))
aggregate result:sum(metric), group_by(dimension)

Note: No pipe | between align options(bins: 1) and aggregate!

Time-series (multiple rows per group):

align 5m, metric:sum(m("metric_name"))
| aggregate result:sum(metric), group_by(dimension)

Note: Pipe | required between align 5m and aggregate!

Period-Over-Period Comparison with Timeshift + Union

For comparing entire periods (e.g., "this hour" vs "exactly 1 hour ago"), use the timeshift + union pattern with subquery definitions.

Key Difference:

  • window(lag()): Compares adjacent buckets (5-min to 5-min, approximate)
  • timeshift + union: Compares entire periods (exact time offset: 1h, 1d, 1w)

Working Example (✅ works in all query contexts):

@current <- @ {
    align rate:sum(m("span_call_count_5m"))
    aggregate current_sum:sum(rate), group_by(service_name)
}
@previous <- @ {
    timeshift 1h                        # Shift BEFORE align!
    align rate:sum(m("span_call_count_5m"))
    aggregate prev_sum:sum(rate), group_by(service_name)
}
@combined <- @current {
    union @previous
    aggregate current:any_not_null(current_sum),
              previous:any_not_null(prev_sum),
              group_by(service_name)
    make_col change:current - previous
    make_col pct_change:if(previous > 0, (change / previous) * 100, 0)
    make_col abs_pct_change:if(pct_change < 0, -pct_change, pct_change)
}
<- @combined {
    filter abs_pct_change > 50
    sort desc(abs_pct_change)
    limit 10
}

Critical Points:

  1. @subquery <- @: Use @ alone to reference the primary input dataset
  2. timeshift BEFORE align: Operates on raw data, shifts timestamps before aggregation
  3. Separate aggregation: Both series must be aggregated independently
  4. any_not_null() collapses union: Combines current/previous into single row per dimension
  5. Works everywhere: MCP queries, worksheets, and monitors all support this syntax

Use Cases:

  • Day-over-day comparison: "Today vs yesterday" (use timeshift 1d)
  • Week-over-week trending: "This week vs last week" (use timeshift 7d)
  • Hour-over-hour spikes: "This hour vs 1 hour ago" (use timeshift 1h)
  • SLA violations: "Current vs same period last month" (use timeshift 30d)

Tested Results:

  • ✅ Detected 200% increase in service request rate (18 vs 6)
  • ✅ Detected 92% drop in request volume (1 vs 13)
  • ✅ Works with any timeshift duration (1h, 6h, 1d, 7d, etc.)

Comparison with window(lag()):

Feature window(lag(rate, N)) timeshift + union
Time precision Approximate (N buckets back) Exact (fixed time offset)
Example lag(rate, 12) ≈ 1 hour (if buckets are 5min) timeshift 1h = exactly 60 minutes
Complexity Simple, one query More complex, subqueries + union
Use case Real-time spike detection Period-over-period reporting
Best for "Current vs previous bucket" "Current vs same time yesterday"

When to use each:

  • Use window(lag()) for: Real-time alerts, simple spike detection, fast queries
  • Use timeshift + union for: Exact period comparison, day-over-day reports, SLA tracking

Troubleshooting

Issue: "Unknown function 'over()'"

Cause: Using SQL window function syntax instead of OPAL syntax

Solution: Use window(lag(...), group_by(...)) instead of lag(...) over (...)

Example:

# WRONG
make_col prev:lag(value, 1) over (partition by service order by time)

# CORRECT
make_col prev:window(lag(value, 1), group_by(service))

Issue: High false positive rate

Cause: Threshold too sensitive or baseline includes anomalies

Solutions:

  1. Increase threshold: Use 3-sigma instead of 2-sigma for Z-score
  2. Combine methods: Require multiple detection methods to agree
  3. Filter baseline: Exclude known anomaly periods from baseline calculation
  4. Add minimum value filter: Avoid alerting on very low absolute values

Example with minimum value filter:

| make_col is_spike:if(pct_change > 100 and metric_value > 10, true, false)

Issue: Missing anomalies (false negatives)

Cause: Threshold too strict or wrong detection method for data type

Solutions:

  1. Decrease threshold: Use 1.5-sigma or lower percentile (p90 instead of p95)
  2. Try different method: IQR if data is skewed, rate-of-change for sudden spikes
  3. Check data distribution: Visualize baseline to understand normal range
  4. Use multiple methods: Catch different types of anomalies

Issue: Division by zero or very small numbers

Cause: Calculating percentage change when previous value is zero or very small

Solution: Add conditional check for minimum denominator:

make_col pct_change:if(previous_value > 5, (value_change / previous_value) * 100, 0)

Issue: Window function returns null values

Cause: First row in group has no previous value for lag()

Solution: This is expected behavior - first row will have null for lag(). Filter nulls or provide default:

make_col previous_value:window(lag(metric_value, 1), group_by(service_name))
| filter not is_null(previous_value)

Or use default value (though not directly supported in current lag syntax):

make_col pct_change:if(is_null(previous_value), 0, (value_change / previous_value) * 100)

Key Takeaways

  1. Choose detection method based on anomaly type and data distribution

    • Z-Score for normal distributions
    • IQR for skewed data
    • Rate-of-change for sudden spikes
    • Moving average for trend deviations
  2. OPAL window functions use different syntax from SQL

    • Use window(lag(...), group_by(...)) NOT lag(...) over (...)
    • Works with both metrics (align) and raw datasets
  3. Combine multiple methods to reduce false positives

    • Require Z-score AND percentile agreement
    • Add minimum value filters for rate-of-change
    • Correlate multiple metrics (requests + errors)
  4. Tune thresholds based on metric characteristics

    • Volatile metrics: Higher thresholds (3-sigma, 100% change)
    • Stable metrics: Lower thresholds (2-sigma, 50% change)
    • Test and iterate based on false positive rate
  5. Derived columns require separate make_col statements

    • Cannot reference newly created column in same make_col
    • Use pipeline of make_col statements for sequential calculations
  6. Frame specification enables sliding window calculations

    • frame(back:30m) for 30-minute moving average
    • Shorter frames for fast-changing metrics
    • Longer frames for stable baselines
  7. Metrics queries have two distinct patterns

    • options(bins: 1) for summary (no pipe before aggregate)
    • align 5m for time-series (pipe required before aggregate)
  8. Statistical methods work best with sufficient historical data

    • Need enough data points for meaningful stddev/percentiles
    • Consider minimum sample size (e.g., 24 hours of 5m buckets = 288 samples)
  9. Rate-of-change detection is powerful but requires careful tuning

    • Very effective for early spike detection
    • Prone to false positives with low baseline values
    • Add minimum value and absolute change filters
  10. Test detection patterns against historical data

    • Validate false positive rate on known-good periods
    • Verify detection on known anomaly events
    • Adjust thresholds based on operational feedback

When to Use This Skill

Use detecting-anomalies skill when:

  • User asks to check for anomalies
  • Creating alert rules for unusual behavior
  • Investigating performance degradation or incidents
  • Identifying outliers in service metrics
  • Detecting sudden traffic spikes or drops
  • Comparing current values to historical baselines
  • Setting up SLO violation alerts
  • Analyzing metrics for unusual patterns

Cross-references:

  • aggregating-gauge-metrics (for metric query patterns)
  • analyzing-tdigest-metrics (for percentile-based detection)
  • time-series-analysis (for temporal trending)
  • working-with-intervals (for span-based anomaly detection)
  • window-functions-deep-dive (to better understand window functions)

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:12/23/2025