Window Function Builder

by dengineproblem

skill

Эксперт SQL window functions. Используй для аналитических запросов, OVER clauses и ranking functions.

Skill Details

Repository Files

1 file in this skill directory


name: window-function-builder description: Эксперт SQL window functions. Используй для аналитических запросов, OVER clauses и ranking functions.

SQL Window Function Builder

Эксперт по SQL window functions и аналитическим запросам для бизнес-аналитики.

Core Function Categories

Function Overview

window_functions:
  ranking:
    - ROW_NUMBER(): "Unique sequential numbers"
    - RANK(): "Ranking with gaps for ties"
    - DENSE_RANK(): "Ranking without gaps"
    - NTILE(n): "Divide into n buckets"

  aggregate:
    - SUM(): "Running/cumulative totals"
    - AVG(): "Moving averages"
    - COUNT(): "Running counts"
    - MIN(): "Running minimum"
    - MAX(): "Running maximum"

  offset:
    - LAG(col, n): "Access previous row"
    - LEAD(col, n): "Access next row"
    - FIRST_VALUE(): "First value in frame"
    - LAST_VALUE(): "Last value in frame"
    - NTH_VALUE(col, n): "Nth value in frame"

  statistical:
    - PERCENT_RANK(): "Relative rank as percentage"
    - CUME_DIST(): "Cumulative distribution"
    - PERCENTILE_CONT(): "Interpolated percentile"
    - PERCENTILE_DISC(): "Discrete percentile"

Basic Syntax

OVER Clause Structure

function_name(expression) OVER (
    [PARTITION BY partition_expression, ...]
    [ORDER BY sort_expression [ASC|DESC], ...]
    [frame_clause]
)

Frame Specifications

-- Frame clause syntax
ROWS | RANGE BETWEEN frame_start AND frame_end

-- Frame bounds
UNBOUNDED PRECEDING  -- From first row of partition
n PRECEDING          -- n rows before current
CURRENT ROW          -- Current row
n FOLLOWING          -- n rows after current
UNBOUNDED FOLLOWING  -- To last row of partition

-- Common frame patterns
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW  -- Running total (default for ORDER BY)
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW          -- 7-day rolling
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING          -- 3-row centered
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW  -- 30-day range

Ranking Functions

ROW_NUMBER, RANK, DENSE_RANK

-- Basic ranking comparison
SELECT
    employee_id,
    department,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
    RANK() OVER (ORDER BY salary DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;

-- Results with tied salaries:
-- | employee_id | salary | row_num | rank | dense_rank |
-- |-------------|--------|---------|------|------------|
-- | 101         | 100000 | 1       | 1    | 1          |
-- | 102         | 100000 | 2       | 1    | 1          |  <- same salary
-- | 103         | 90000  | 3       | 3    | 2          |  <- note rank skips 2

Ranking Within Groups

-- Top 3 salaries per department
WITH ranked AS (
    SELECT
        employee_id,
        employee_name,
        department,
        salary,
        DENSE_RANK() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        ) AS dept_rank
    FROM employees
)
SELECT *
FROM ranked
WHERE dept_rank <= 3;

NTILE for Percentiles

-- Divide customers into quartiles by revenue
SELECT
    customer_id,
    customer_name,
    total_revenue,
    NTILE(4) OVER (ORDER BY total_revenue DESC) AS revenue_quartile,
    CASE NTILE(4) OVER (ORDER BY total_revenue DESC)
        WHEN 1 THEN 'Top 25%'
        WHEN 2 THEN '25-50%'
        WHEN 3 THEN '50-75%'
        WHEN 4 THEN 'Bottom 25%'
    END AS segment
FROM customer_revenue;

Running Calculations

Running Totals

-- Running total of sales
SELECT
    order_date,
    order_amount,
    SUM(order_amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM orders;

-- Running total per customer
SELECT
    customer_id,
    order_date,
    order_amount,
    SUM(order_amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS customer_running_total
FROM orders;

Moving Averages

-- 7-day moving average
SELECT
    date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d,
    -- Also track the count for partial windows
    COUNT(*) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS days_in_window
FROM daily_sales;

-- Centered moving average (3 days before, current, 3 days after)
SELECT
    date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
    ) AS centered_avg_7d
FROM daily_sales;

Cumulative Percentage

-- Cumulative percentage of total
SELECT
    product_category,
    revenue,
    SUM(revenue) OVER (ORDER BY revenue DESC) AS cumulative_revenue,
    ROUND(
        100.0 * SUM(revenue) OVER (ORDER BY revenue DESC) /
        SUM(revenue) OVER (),
        2
    ) AS cumulative_pct
FROM category_sales
ORDER BY revenue DESC;

LAG and LEAD

Period-over-Period Comparison

-- Month-over-month growth
SELECT
    month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
    revenue - LAG(revenue, 1) OVER (ORDER BY month) AS mom_change,
    ROUND(
        100.0 * (revenue - LAG(revenue, 1) OVER (ORDER BY month)) /
        NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0),
        2
    ) AS mom_growth_pct
FROM monthly_revenue;

-- Year-over-year comparison
SELECT
    date,
    revenue,
    LAG(revenue, 365) OVER (ORDER BY date) AS yoy_revenue,
    revenue - LAG(revenue, 365) OVER (ORDER BY date) AS yoy_change
FROM daily_revenue;

Gap Analysis

-- Find gaps between orders
SELECT
    customer_id,
    order_date,
    LAG(order_date) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS previous_order_date,
    order_date - LAG(order_date) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS days_since_last_order
FROM orders;

-- Identify churned customers (no order > 90 days)
WITH order_gaps AS (
    SELECT
        customer_id,
        order_date,
        LEAD(order_date) OVER (
            PARTITION BY customer_id
            ORDER BY order_date
        ) AS next_order_date,
        LEAD(order_date) OVER (
            PARTITION BY customer_id
            ORDER BY order_date
        ) - order_date AS gap_days
    FROM orders
)
SELECT DISTINCT customer_id
FROM order_gaps
WHERE gap_days > 90
   OR (next_order_date IS NULL AND order_date < CURRENT_DATE - INTERVAL '90 days');

Lead for Future Values

-- Forecast vs actual comparison
SELECT
    forecast_date,
    predicted_value,
    LEAD(actual_value, 7) OVER (ORDER BY forecast_date) AS actual_7d_later,
    LEAD(actual_value, 7) OVER (ORDER BY forecast_date) - predicted_value AS forecast_error
FROM forecasts;

FIRST_VALUE and LAST_VALUE

First/Last in Group

-- First and last order per customer
SELECT
    customer_id,
    order_id,
    order_date,
    order_amount,
    FIRST_VALUE(order_date) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS first_order_date,
    LAST_VALUE(order_date) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_order_date
FROM orders;

-- Note: LAST_VALUE requires explicit frame to see all rows!

Baseline Comparison

-- Compare each day to first day of month
SELECT
    date,
    revenue,
    FIRST_VALUE(revenue) OVER (
        PARTITION BY DATE_TRUNC('month', date)
        ORDER BY date
    ) AS first_day_revenue,
    revenue - FIRST_VALUE(revenue) OVER (
        PARTITION BY DATE_TRUNC('month', date)
        ORDER BY date
    ) AS diff_from_first_day
FROM daily_revenue;

Statistical Functions

Percentile Calculations

-- Calculate percentiles
SELECT
    product_id,
    price,
    PERCENT_RANK() OVER (ORDER BY price) AS percent_rank,
    CUME_DIST() OVER (ORDER BY price) AS cumulative_dist,
    NTILE(100) OVER (ORDER BY price) AS percentile
FROM products;

-- Median calculation (50th percentile)
SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary_discrete,
    AVG(salary) AS mean_salary
FROM employees;

-- Multiple percentiles at once
SELECT
    department,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) AS median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS p75,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY salary) AS p90
FROM employees
GROUP BY department;

Business Analytics Patterns

Customer Lifecycle Analysis

-- Customer order sequence and lifecycle metrics
WITH customer_orders AS (
    SELECT
        customer_id,
        order_id,
        order_date,
        order_amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date
        ) AS order_number,
        FIRST_VALUE(order_date) OVER (
            PARTITION BY customer_id
            ORDER BY order_date
        ) AS first_order_date,
        SUM(order_amount) OVER (
            PARTITION BY customer_id
            ORDER BY order_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS lifetime_value,
        LAG(order_date) OVER (
            PARTITION BY customer_id
            ORDER BY order_date
        ) AS previous_order_date
    FROM orders
)
SELECT
    customer_id,
    order_id,
    order_number,
    order_amount,
    lifetime_value,
    order_date - first_order_date AS days_since_first_order,
    order_date - previous_order_date AS days_since_last_order,
    CASE
        WHEN order_number = 1 THEN 'New'
        WHEN order_date - previous_order_date > 90 THEN 'Reactivated'
        ELSE 'Repeat'
    END AS customer_status
FROM customer_orders;

Cohort Analysis

-- Monthly cohort retention
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', first_activity_date) AS cohort_month,
        DATE_TRUNC('month', activity_date) AS activity_month
    FROM user_activity
),
cohort_sizes AS (
    SELECT
        cohort_month,
        COUNT(DISTINCT user_id) AS cohort_size
    FROM user_cohorts
    GROUP BY cohort_month
),
monthly_activity AS (
    SELECT
        cohort_month,
        activity_month,
        COUNT(DISTINCT user_id) AS active_users,
        EXTRACT(MONTH FROM AGE(activity_month, cohort_month)) AS month_number
    FROM user_cohorts
    GROUP BY cohort_month, activity_month
)
SELECT
    ma.cohort_month,
    ma.month_number,
    cs.cohort_size,
    ma.active_users,
    ROUND(100.0 * ma.active_users / cs.cohort_size, 2) AS retention_rate
FROM monthly_activity ma
JOIN cohort_sizes cs ON ma.cohort_month = cs.cohort_month
ORDER BY ma.cohort_month, ma.month_number;

Sales Performance

-- Sales rep performance with rankings and targets
WITH sales_performance AS (
    SELECT
        sales_rep_id,
        rep_name,
        region,
        SUM(deal_amount) AS total_sales,
        COUNT(*) AS deal_count,
        AVG(deal_amount) AS avg_deal_size
    FROM sales_deals
    WHERE close_date >= DATE_TRUNC('quarter', CURRENT_DATE)
    GROUP BY sales_rep_id, rep_name, region
)
SELECT
    sales_rep_id,
    rep_name,
    region,
    total_sales,
    deal_count,
    avg_deal_size,
    RANK() OVER (ORDER BY total_sales DESC) AS overall_rank,
    RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) AS region_rank,
    total_sales - AVG(total_sales) OVER () AS vs_company_avg,
    total_sales - AVG(total_sales) OVER (PARTITION BY region) AS vs_region_avg,
    PERCENT_RANK() OVER (ORDER BY total_sales) AS percentile
FROM sales_performance;

Inventory Analysis

-- Inventory movement analysis
SELECT
    product_id,
    transaction_date,
    transaction_type,
    quantity,
    SUM(
        CASE WHEN transaction_type = 'IN' THEN quantity
             WHEN transaction_type = 'OUT' THEN -quantity
             ELSE 0
        END
    ) OVER (
        PARTITION BY product_id
        ORDER BY transaction_date, transaction_id
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_inventory,
    AVG(
        CASE WHEN transaction_type = 'OUT' THEN quantity END
    ) OVER (
        PARTITION BY product_id
        ORDER BY transaction_date
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) AS avg_daily_usage_30d
FROM inventory_transactions;

Platform-Specific Features

PostgreSQL

-- FILTER clause with window functions
SELECT
    date,
    category,
    amount,
    SUM(amount) OVER (ORDER BY date) AS total_running,
    SUM(amount) FILTER (WHERE category = 'A') OVER (ORDER BY date) AS category_a_running
FROM transactions;

-- GROUPS frame type
SELECT
    date,
    amount,
    SUM(amount) OVER (
        ORDER BY date
        GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS sum_adjacent_groups
FROM transactions;

SQL Server

-- STRING_AGG with OVER (SQL Server 2017+)
-- Not directly supported, use workaround:
SELECT DISTINCT
    department,
    STRING_AGG(employee_name, ', ') WITHIN GROUP (ORDER BY employee_name)
        OVER (PARTITION BY department) AS employees
FROM employees;

-- OFFSET with frames (SQL Server 2022+)
SELECT
    date,
    value,
    AVG(value) OVER (
        ORDER BY date
        ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
    ) AS previous_week_avg
FROM daily_metrics;

BigQuery

-- QUALIFY clause (filter on window function results)
SELECT
    customer_id,
    order_date,
    order_amount
FROM orders
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY order_date DESC
) = 1;

-- Equivalent to CTE approach:
-- WITH ranked AS (
--     SELECT *, ROW_NUMBER() OVER (...) AS rn FROM orders
-- )
-- SELECT * FROM ranked WHERE rn = 1;

Snowflake

-- QUALIFY clause
SELECT *
FROM sales
QUALIFY DENSE_RANK() OVER (
    PARTITION BY region
    ORDER BY revenue DESC
) <= 5;

-- CONDITIONAL_TRUE_EVENT for sessionization
SELECT
    user_id,
    event_time,
    CONDITIONAL_TRUE_EVENT(
        DATEDIFF('minute', LAG(event_time) OVER (
            PARTITION BY user_id ORDER BY event_time
        ), event_time) > 30
    ) OVER (
        PARTITION BY user_id
        ORDER BY event_time
    ) AS session_id
FROM user_events;

Performance Optimization

Indexing Strategy

-- Composite index for window function
-- Index columns: PARTITION BY columns first, then ORDER BY columns
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date);

-- For running totals across all data
CREATE INDEX idx_orders_date_amount
ON orders (order_date, order_amount);

Query Optimization Tips

optimization_tips:
  - tip: "Add indexes on PARTITION BY + ORDER BY columns"
    reason: "Reduces sort operations"

  - tip: "Use CTEs to compute window once, reference multiple times"
    reason: "Avoids redundant calculations"

  - tip: "Limit rows before applying window functions when possible"
    reason: "Smaller dataset = faster windows"

  - tip: "Be explicit about frame clause"
    reason: "Prevents unexpected defaults"

  - tip: "Use QUALIFY when supported"
    reason: "Cleaner and often faster than subquery"

CTE Pattern for Clarity

-- Use CTEs for complex window queries
WITH
-- Step 1: Calculate raw metrics
daily_metrics AS (
    SELECT
        date,
        SUM(amount) AS daily_total
    FROM transactions
    GROUP BY date
),
-- Step 2: Add window calculations
with_windows AS (
    SELECT
        date,
        daily_total,
        SUM(daily_total) OVER (ORDER BY date) AS running_total,
        AVG(daily_total) OVER (
            ORDER BY date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS rolling_7d_avg,
        LAG(daily_total, 7) OVER (ORDER BY date) AS same_day_last_week
    FROM daily_metrics
)
-- Step 3: Final calculations
SELECT
    date,
    daily_total,
    running_total,
    rolling_7d_avg,
    daily_total - same_day_last_week AS wow_change
FROM with_windows
WHERE date >= CURRENT_DATE - INTERVAL '30 days';

Лучшие практики

  1. Явно указывай frame clause — дефолтное поведение может быть неожиданным
  2. Используй CTE для сложных запросов — улучшает читаемость и maintainability
  3. Создавай composite индексы — на PARTITION BY + ORDER BY колонки
  4. Проверяй NULL handling — LAG/LEAD возвращают NULL для граничных строк
  5. Тестируй edge cases — первая/последняя строка, одна строка в партиции
  6. Используй QUALIFY где поддерживается — чище чем subquery
  7. Учитывай особенности СУБД — синтаксис и функции различаются
  8. Оптимизируй сначала WHERE — фильтруй до применения window functions

Related Skills

Attack Tree Construction

Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.

skill

Grafana Dashboards

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

skill

Matplotlib

Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.

skill

Scientific Visualization

Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.

skill

Seaborn

Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.

skill

Shap

Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model

skill

Pydeseq2

Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.

skill

Query Writing

For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations

skill

Pydeseq2

Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.

skill

Scientific Visualization

Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.

skill

Skill Information

Category:Skill
Last Updated:12/11/2025