Analytics Engineer
by borghei
Expert analytics engineering covering data modeling, dbt development, data transformation, and semantic layer management.
Skill Details
Repository Files
1 file in this skill directory
name: analytics-engineer description: Expert analytics engineering covering data modeling, dbt development, data transformation, and semantic layer management. version: 1.0.0 author: Claude Skills category: data-analytics tags: [analytics-engineering, dbt, data-modeling, transformation, semantic-layer]
Analytics Engineer
Expert-level analytics engineering for scalable data transformation.
Core Competencies
- Data modeling
- dbt development
- SQL transformation
- Semantic layer design
- Data testing
- Documentation
- Performance optimization
- Pipeline orchestration
Analytics Engineering Stack
Modern Data Stack
SOURCES → INGESTION → WAREHOUSE → TRANSFORMATION → SEMANTIC → BI
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
APIs Fivetran Snowflake dbt Looker Tableau
DBs Airbyte BigQuery Dataform Transform PBI
Files Stitch Redshift Spark SQL dbt ML Metabase
Project Structure (dbt)
analytics/
├── dbt_project.yml
├── profiles.yml
├── models/
│ ├── staging/ # Raw → Cleaned
│ │ ├── stg_*.sql
│ │ └── _stg_*.yml
│ ├── intermediate/ # Business logic
│ │ ├── int_*.sql
│ │ └── _int_*.yml
│ └── marts/ # Final models
│ ├── core/
│ │ ├── dim_*.sql
│ │ └── fct_*.sql
│ ├── marketing/
│ └── finance/
├── macros/
├── tests/
├── seeds/
├── snapshots/
└── analyses/
Data Modeling
Dimensional Modeling
Star Schema:
┌──────────────┐
│ dim_date │
└──────┬───────┘
│
┌──────────────┐ ┌──────┴───────┐ ┌──────────────┐
│ dim_customer │────│ fct_orders │────│ dim_product │
└──────────────┘ └──────┬───────┘ └──────────────┘
│
┌──────┴───────┐
│ dim_store │
└──────────────┘
Dimension Table Pattern:
-- models/marts/core/dim_customer.sql
WITH customers AS (
SELECT * FROM {{ ref('stg_crm__customers') }}
),
addresses AS (
SELECT * FROM {{ ref('stg_crm__addresses') }}
),
customer_orders AS (
SELECT
customer_id,
MIN(order_date) AS first_order_date,
MAX(order_date) AS most_recent_order_date,
COUNT(*) AS lifetime_orders,
SUM(order_amount) AS lifetime_value
FROM {{ ref('stg_orders__orders') }}
GROUP BY customer_id
),
final AS (
SELECT
customers.customer_id,
customers.customer_name,
customers.email,
customers.created_at,
addresses.city,
addresses.state,
addresses.country,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
customer_orders.lifetime_orders,
customer_orders.lifetime_value,
CASE
WHEN customer_orders.lifetime_value >= 10000 THEN 'platinum'
WHEN customer_orders.lifetime_value >= 5000 THEN 'gold'
WHEN customer_orders.lifetime_value >= 1000 THEN 'silver'
ELSE 'bronze'
END AS customer_tier
FROM customers
LEFT JOIN addresses
ON customers.address_id = addresses.address_id
LEFT JOIN customer_orders
ON customers.customer_id = customer_orders.customer_id
)
SELECT * FROM final
Fact Table Pattern:
-- models/marts/core/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
partition_by={'field': 'order_date', 'data_type': 'date'},
cluster_by=['customer_id', 'product_id']
)
}}
WITH orders AS (
SELECT * FROM {{ ref('stg_orders__orders') }}
{% if is_incremental() %}
WHERE order_date >= (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
),
order_items AS (
SELECT * FROM {{ ref('stg_orders__order_items') }}
),
final AS (
SELECT
orders.order_id,
orders.order_date,
orders.customer_id,
order_items.product_id,
orders.store_id,
order_items.quantity,
order_items.unit_price,
order_items.quantity * order_items.unit_price AS line_total,
orders.discount_amount,
orders.tax_amount,
orders.shipping_amount,
orders.total_amount
FROM orders
INNER JOIN order_items
ON orders.order_id = order_items.order_id
)
SELECT * FROM final
Staging Layer
-- models/staging/crm/stg_crm__customers.sql
WITH source AS (
SELECT * FROM {{ source('crm', 'customers') }}
),
renamed AS (
SELECT
-- Primary key
id AS customer_id,
-- Strings
TRIM(LOWER(name)) AS customer_name,
TRIM(LOWER(email)) AS email,
-- Dates
created_at::timestamp AS created_at,
updated_at::timestamp AS updated_at,
-- Booleans
is_active::boolean AS is_active,
-- Metadata
_fivetran_synced AS _loaded_at
FROM source
WHERE _fivetran_deleted = false
)
SELECT * FROM renamed
Source Configuration
# models/staging/crm/_crm__sources.yml
version: 2
sources:
- name: crm
description: Customer relationship management system
database: raw
schema: crm
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: customers
description: Customer master data
columns:
- name: id
description: Primary key
tests:
- unique
- not_null
- name: email
tests:
- unique
Data Testing
Test Types
# models/marts/core/_core__models.yml
version: 2
models:
- name: dim_customer
description: Customer dimension table
columns:
- name: customer_id
description: Primary key
tests:
- unique
- not_null
- name: email
tests:
- unique
- not_null
- name: customer_tier
tests:
- accepted_values:
values: ['platinum', 'gold', 'silver', 'bronze']
- name: lifetime_value
tests:
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: fct_orders
description: Order fact table
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- order_id
- product_id
columns:
- name: customer_id
tests:
- relationships:
to: ref('dim_customer')
field: customer_id
Custom Tests
-- tests/assert_positive_amounts.sql
{% test positive_amount(model, column_name) %}
SELECT
{{ column_name }}
FROM {{ model }}
WHERE {{ column_name }} < 0
{% endtest %}
-- tests/generic/assert_row_count_equal.sql
{% test row_count_equal(model, compare_model) %}
WITH source_count AS (
SELECT COUNT(*) AS cnt FROM {{ model }}
),
compare_count AS (
SELECT COUNT(*) AS cnt FROM {{ ref(compare_model) }}
)
SELECT *
FROM source_count
CROSS JOIN compare_count
WHERE source_count.cnt != compare_count.cnt
{% endtest %}
Macros and DRY Patterns
Common Macros
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
({{ column_name }} / 100.0)::decimal(18,2)
{% endmacro %}
-- macros/pivot_values.sql
{% macro pivot_values(column_name, values, alias_prefix='') %}
{% for value in values %}
SUM(CASE WHEN {{ column_name }} = '{{ value }}' THEN 1 ELSE 0 END)
AS {{ alias_prefix }}{{ value | lower | replace(' ', '_') }}
{% if not loop.last %},{% endif %}
{% endfor %}
{% endmacro %}
Incremental Patterns
-- macros/incremental_filter.sql
{% macro get_incremental_filter(column_name, lookback_days=3) %}
{% if is_incremental() %}
WHERE {{ column_name }} >= (
SELECT DATEADD(day, -{{ lookback_days }}, MAX({{ column_name }}))
FROM {{ this }}
)
{% endif %}
{% endmacro %}
Semantic Layer
Metric Definitions
# models/marts/core/_core__metrics.yml
version: 2
metrics:
- name: revenue
label: Total Revenue
model: ref('fct_orders')
description: Sum of all order amounts
calculation_method: sum
expression: total_amount
timestamp: order_date
time_grains: [day, week, month, quarter, year]
dimensions:
- customer_tier
- product_category
- store_region
filters:
- field: is_cancelled
operator: '='
value: 'false'
- name: average_order_value
label: Average Order Value
model: ref('fct_orders')
description: Average order amount
calculation_method: average
expression: total_amount
timestamp: order_date
time_grains: [day, week, month]
- name: customer_count
label: Customer Count
model: ref('dim_customer')
calculation_method: count_distinct
expression: customer_id
Exposures
# models/exposures.yml
version: 2
exposures:
- name: executive_dashboard
type: dashboard
maturity: high
url: https://tableau.company.com/views/executive
description: Executive KPI dashboard
depends_on:
- ref('fct_orders')
- ref('dim_customer')
- ref('dim_product')
owner:
name: Analytics Team
email: analytics@company.com
- name: marketing_report
type: notebook
maturity: medium
url: https://databricks.company.com/notebooks/marketing
depends_on:
- ref('fct_marketing_events')
- ref('dim_campaign')
owner:
name: Marketing Analytics
email: marketing-analytics@company.com
Performance Optimization
Materialization Strategy
| Layer | Materialization | Reason |
|---|---|---|
| Staging | View | Raw data, no aggregation |
| Intermediate | Ephemeral/View | Business logic, referenced multiple times |
| Marts (small) | Table | Final models, query performance |
| Marts (large) | Incremental | Large fact tables, efficiency |
Query Optimization
-- Before: Expensive window function on full table
SELECT
order_id,
customer_id,
order_date,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS running_total
FROM orders;
-- After: Pre-aggregate then join
WITH daily_totals AS (
SELECT
customer_id,
order_date,
SUM(amount) AS daily_amount
FROM orders
GROUP BY customer_id, order_date
),
running_totals AS (
SELECT
customer_id,
order_date,
SUM(daily_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS running_total
FROM daily_totals
)
SELECT
o.order_id,
o.customer_id,
o.order_date,
rt.running_total
FROM orders o
JOIN running_totals rt
ON o.customer_id = rt.customer_id
AND o.order_date = rt.order_date;
Clustering and Partitioning
{{
config(
materialized='incremental',
unique_key='event_id',
partition_by={
'field': 'event_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['user_id', 'event_type']
)
}}
CI/CD Pipeline
GitHub Actions
# .github/workflows/dbt.yml
name: dbt CI/CD
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: pip install dbt-snowflake
- name: dbt deps
run: dbt deps
- name: dbt compile
run: dbt compile --target ci
- name: dbt test
run: dbt test --target ci
deploy:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: dbt run
run: dbt run --target prod
- name: dbt test
run: dbt test --target prod
Slim CI
# Only run modified models and downstream
dbt run --select state:modified+ --defer --state ./target-base
dbt test --select state:modified+ --defer --state ./target-base
Documentation
Model Documentation
# models/marts/core/_core__models.yml
version: 2
models:
- name: fct_orders
description: |
Order fact table containing one row per order line item.
## Business Logic
- Orders with status 'cancelled' are excluded
- Amounts are in USD
- Tax is calculated at time of order
## Usage
```sql
SELECT * FROM {{ ref('fct_orders') }}
WHERE order_date >= '2024-01-01'
```
## Dependencies
- stg_orders__orders
- stg_orders__order_items
Generate Docs
# Generate and serve documentation
dbt docs generate
dbt docs serve --port 8080
Reference Materials
references/modeling_patterns.md- Data modeling best practicesreferences/dbt_style_guide.md- SQL and dbt conventionsreferences/testing_guide.md- Testing strategiesreferences/optimization.md- Performance tuning
Scripts
# Model impact analyzer
python scripts/impact_analyzer.py --model dim_customer
# Schema change detector
python scripts/schema_diff.py --source prod --target dev
# Documentation generator
python scripts/doc_generator.py --format markdown
# Data quality scorer
python scripts/quality_scorer.py --model fct_orders
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Clinical Decision Support
Generate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug develo
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
