Clickhouse Query Optimization

by dawiddutoit

cli

|

Skill Details

Repository Files

3 files in this skill directory


name: clickhouse-query-optimization description: | Optimizes ClickHouse queries for speed and efficiency. Helps with primary key design, sparse indexes, data skipping indexes (minmax, set, bloom filter, ngrambf_v1), partitioning strategies, projections, PREWHERE optimization, approximate functions, and query profiling with EXPLAIN. Use when writing ClickHouse queries, designing table schemas, analyzing slow queries, or implementing analytical aggregations. Works with columnar OLAP workloads. allowed-tools: Read, Grep, Bash

ClickHouse Query Optimization

Quick Start

Check your query plan:

EXPLAIN
SELECT user_id, COUNT()
FROM events
WHERE timestamp >= '2024-01-01'
GROUP BY user_id;

This shows which parts of the index are used, how many partitions are read, and the aggregation strategy.

When to Use

  • Write fast ClickHouse queries
  • Design table schemas
  • Analyze slow queries
  • Add data skipping indexes
  • Implement partitioning strategies
  • Use projections for multiple access patterns

Core Principles

1. Primary Key Design

The primary key defines sort order (not uniqueness). Order columns by low → high cardinality.

-- Good: country (low) → user_id → timestamp (high)
CREATE TABLE events (
    user_id UInt32,
    timestamp DateTime,
    country String
)
ENGINE = MergeTree()
ORDER BY (country, user_id, timestamp);

Key principle: Queries must filter on primary key prefix to use index.

-- ✅ Fast: Uses index (country first)
SELECT * FROM events WHERE country = 'US';

-- ❌ Slow: Skips index (missing country)
SELECT * FROM events WHERE user_id = 12345;

2. Data Skipping Indexes

For non-primary-key columns:

-- Numeric ranges
ALTER TABLE events ADD INDEX idx_duration session_duration TYPE minmax GRANULARITY 4;

-- Categorical (low cardinality)
ALTER TABLE events ADD INDEX idx_event_type event_type TYPE set(100) GRANULARITY 4;

-- String equality
ALTER TABLE events ADD INDEX idx_url url TYPE bloom_filter(0.01) GRANULARITY 4;

-- Substring search
ALTER TABLE logs ADD INDEX idx_message message TYPE ngrambf_v1(4, 512, 3, 0) GRANULARITY 1;

3. Partitioning for Lifecycle Management

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp);

-- Drop old data instantly
ALTER TABLE events DROP PARTITION '202401';

-- Or use TTL
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;

4. Projections for Multiple Access Patterns

-- Main table sorted by user_id
CREATE TABLE events (
    user_id UInt32,
    product_id UInt32,
    timestamp DateTime
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp);

-- Add projection for product queries
ALTER TABLE events ADD PROJECTION proj_by_product (
    SELECT *
    ORDER BY (product_id, timestamp)
);

ALTER TABLE events MATERIALIZE PROJECTION proj_by_product;

-- Both queries now fast:
SELECT * FROM events WHERE user_id = 12345;    -- Uses main table
SELECT * FROM events WHERE product_id = 789;   -- Uses projection

5. Query Optimization

PREWHERE for Early Filtering:

SELECT user_id, event_type, properties
FROM events
PREWHERE timestamp >= '2024-01-01' AND country = 'US'  -- Small columns first
WHERE event_type IN ('purchase', 'signup');             -- Complex logic

Approximate Functions:

-- 10-100x faster, ~2% error
SELECT uniq(user_id) FROM events;                   -- vs COUNT(DISTINCT)
SELECT topK(10)(product_id) FROM events;            -- Approximate top-K
SELECT quantile(0.95)(response_time) FROM events;   -- Approximate percentile

Select Only Needed Columns:

-- Bad: Reads all columns
SELECT * FROM events WHERE user_id = 12345;

-- Good: Columnar advantage
SELECT user_id, timestamp, event_type FROM events WHERE user_id = 12345;

6. Profile and Debug

-- View execution plan
EXPLAIN SELECT COUNT() FROM events WHERE country = 'US';

-- Check performance
SELECT
    query,
    query_duration_ms,
    read_rows,
    read_bytes
FROM system.query_log
WHERE query LIKE '%events%'
ORDER BY event_time DESC
LIMIT 1;

Common Patterns

Technique Problem Solved Impact When to Use
Primary Key Design Index doesn't cover queries Foundation Always (design first)
Data Skipping Indexes Non-primary filtering slow 10-100x After primary key
Partitioning Need to delete old data Instant deletion Time-series with retention
Projections Multiple query patterns 100-1000x Different sort orders
Query Syntax Large columns read unnecessarily 2-10x Per-query optimization
Profiling Don't know why slow Insight When optimization unclear

Supporting Files

File Purpose
examples/examples.md Real-world optimization scenarios with metrics
references/reference.md Technical guides and decision trees

Requirements

  • ClickHouse 21.4+
  • Understanding of SQL and aggregation
  • Knowledge of query patterns

Integration Tips

  1. Design tables first (use EXPLAIN before/after)
  2. Monitor query_log (alert on > 100M rows read)
  3. Profile inserts (more indexes = slower writes)
  4. Test projections (use EXPLAIN to confirm optimizer choice)

See Also

Related Skills

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clinical Decision Support

Generate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug develo

developmentdocumentcli

Geopandas

Python library for working with geospatial vector data including shapefiles, GeoJSON, and GeoPackage files. Use when working with geographic data for spatial analysis, geometric operations, coordinate transformations, spatial joins, overlay operations, choropleth mapping, or any task involving reading/writing/analyzing vector geographic data. Supports PostGIS databases, interactive maps, and integration with matplotlib/folium/cartopy. Use for tasks like buffer analysis, spatial joins between dat

artdatacli

Datacommons Client

Work with Data Commons, a platform providing programmatic access to public statistical data from global sources. Use this skill when working with demographic data, economic indicators, health statistics, environmental data, or any public datasets available through Data Commons. Applicable for querying population statistics, GDP figures, unemployment rates, disease prevalence, geographic entity resolution, and exploring relationships between statistical entities.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Geopandas

Python library for working with geospatial vector data including shapefiles, GeoJSON, and GeoPackage files. Use when working with geographic data for spatial analysis, geometric operations, coordinate transformations, spatial joins, overlay operations, choropleth mapping, or any task involving reading/writing/analyzing vector geographic data. Supports PostGIS databases, interactive maps, and integration with matplotlib/folium/cartopy. Use for tasks like buffer analysis, spatial joins between dat

artdatacli

Datacommons Client

Work with Data Commons, a platform providing programmatic access to public statistical data from global sources. Use this skill when working with demographic data, economic indicators, health statistics, environmental data, or any public datasets available through Data Commons. Applicable for querying population statistics, GDP figures, unemployment rates, disease prevalence, geographic entity resolution, and exploring relationships between statistical entities.

datacli

Clinical Decision Support

Generate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug develo

developmentdocumentcli

Clickhouse Query

Run ClickHouse queries for analytics, metrics analysis, and event data exploration. Use when you need to query ClickHouse directly, analyze metrics, check event tracking data, or test query performance. Read-only by default.

datacli

Skill Information

Category:Technical
Allowed Tools:Read, Grep, Bash
Last Updated:1/26/2026