Data Pipeline Engineer

by erichowens

designapidata

Expert data engineer for ETL/ELT pipelines, streaming, data warehousing. Activate on: data pipeline, ETL, ELT, data warehouse, Spark, Kafka, Airflow, dbt, data modeling, star schema, streaming data, batch processing, data quality. NOT for: API design (use api-architect), ML training (use ML skills), dashboards (use design skills).

Skill Details

Repository Files

7 files in this skill directory


name: data-pipeline-engineer description: "Expert data engineer for ETL/ELT pipelines, streaming, data warehousing. Activate on: data pipeline, ETL, ELT, data warehouse, Spark, Kafka, Airflow, dbt, data modeling, star schema, streaming data, batch processing, data quality. NOT for: API design (use api-architect), ML training (use ML skills), dashboards (use design skills)." allowed-tools: Read,Write,Edit,Bash(dbt:,spark-submit:,airflow:,python:) category: Data & Analytics tags:

  • etl
  • spark
  • kafka
  • airflow
  • data-warehouse pairs-with:
  • skill: api-architect reason: APIs that consume pipeline data
  • skill: devops-automator reason: Orchestrate pipeline infrastructure

Data Pipeline Engineer

Expert data engineer specializing in ETL/ELT pipelines, streaming architectures, data warehousing, and modern data stack implementation.

Quick Start

  1. Identify sources - data formats, volumes, freshness requirements
  2. Choose architecture - Medallion (Bronze/Silver/Gold), Lambda, or Kappa
  3. Design layers - staging → intermediate → marts (dbt pattern)
  4. Add quality gates - Great Expectations or dbt tests at each layer
  5. Orchestrate - Airflow DAGs with sensors and retries
  6. Monitor - lineage, freshness, anomaly detection

Core Capabilities

Capability Technologies Key Patterns
Batch Processing Spark, dbt, Databricks Incremental, partitioning, Delta/Iceberg
Stream Processing Kafka, Flink, Spark Streaming Watermarks, exactly-once, windowing
Orchestration Airflow, Dagster, Prefect DAG design, sensors, task groups
Data Modeling dbt, SQL Kimball, Data Vault, SCD
Data Quality Great Expectations, dbt tests Validation suites, freshness

Architecture Patterns

Medallion Architecture (Recommended)

BRONZE (Raw)     → Exact source copy, schema-on-read, partitioned by ingestion
      ↓ Cleaning, Deduplication
SILVER (Cleansed) → Validated, standardized, business logic applied
      ↓ Aggregation, Enrichment
GOLD (Business)   → Dimensional models, aggregates, ready for BI/ML

Lambda vs Kappa

  • Lambda: Batch + Stream layers → merged serving layer (complex but complete)
  • Kappa: Stream-only with replay → simpler but requires robust streaming

Reference Examples

Full implementation examples in ./references/:

File Description
dbt-project-structure.md Complete dbt layout with staging, intermediate, marts
airflow-dag.py Production DAG with sensors, task groups, quality checks
spark-streaming.py Kafka-to-Delta processor with windowing
great-expectations-suite.json Comprehensive data quality expectation suite

Anti-Patterns (10 Critical Mistakes)

1. Full Table Refreshes

Symptom: Truncate and rebuild entire tables every run Fix: Use incremental models with is_incremental(), partition by date

2. Tight Coupling to Source Schemas

Symptom: Pipeline breaks when upstream adds/removes columns Fix: Explicit source contracts, select only needed columns in staging

3. Monolithic DAGs

Symptom: One 200-task DAG running 8 hours Fix: Domain-specific DAGs, ExternalTaskSensor for dependencies

4. No Data Quality Gates

Symptom: Bad data reaches production before detection Fix: Great Expectations or dbt tests at each layer, block on failures

5. Processing Before Archiving

Symptom: Raw data transformed without preserving original Fix: Always land raw in Bronze first, make transformations reproducible

6. Hardcoded Dates in Queries

Symptom: Manual updates needed for date filters Fix: Use Airflow templating (e.g., ds variable) or dynamic date functions

7. Missing Watermarks in Streaming

Symptom: Unbounded state growth, OOM in long-running jobs Fix: Add withWatermark() to handle late-arriving data

8. No Retry/Backoff Strategy

Symptom: Transient failures cause DAG failures Fix: retries=3, retry_exponential_backoff=True, max_retry_delay

9. Undocumented Data Lineage

Symptom: No one knows where data comes from or who uses it Fix: dbt docs, data catalog integration, column-level lineage

10. Testing Only in Production

Symptom: Bugs discovered by stakeholders, not engineers Fix: dbt --target dev, sample datasets, CI/CD for models

Quality Checklist

Pipeline Design:

  • Incremental processing where possible
  • Idempotent transformations (re-runnable safely)
  • Partitioning strategy defined and documented
  • Backfill procedures documented

Data Quality:

  • Tests at Bronze layer (schema, nulls, ranges)
  • Tests at Silver layer (business rules, referential integrity)
  • Tests at Gold layer (aggregation checks, trend monitoring)
  • Anomaly detection for volumes and distributions

Orchestration:

  • Retry and alerting configured
  • SLAs defined and monitored
  • Cross-DAG dependencies use sensors
  • max_active_runs prevents parallel conflicts

Operations:

  • Data lineage documented
  • Runbooks for common failures
  • Monitoring dashboards for pipeline health
  • On-call procedures defined

Validation Script

Run ./scripts/validate-pipeline.sh to check:

  • dbt project structure and conventions
  • Airflow DAG best practices
  • Spark job configurations
  • Data quality setup

External Resources

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Team Composition Analysis

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

artdesign

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Skill Information

Category:Creative
Allowed Tools:Read,Write,Edit,Bash(dbt:*,spark-submit:*,airflow:*,python:*)
Last Updated:12/26/2025