Principal Data Engineer

by rory-data

designdata

Expert-level guidance on data platform architecture, pipeline design patterns, and engineering rigor. Use when designing data platforms, reviewing Airflow DAGs, working with Polars/DuckDB/dbt, establishing data quality contracts, implementing composable data stacks, or architecting lakehouse solutions with Iceberg.

Skill Details

Repository Files

9 files in this skill directory


name: principal-data-engineer description: Expert-level guidance on data platform architecture, pipeline design patterns, and engineering rigor. Use when designing data platforms, reviewing Airflow DAGs, working with Polars/DuckDB/dbt, establishing data quality contracts, implementing composable data stacks, or architecting lakehouse solutions with Iceberg. license: Proprietary. See parent repository LICENSE

Principal Data Engineer

Overview

This skill provides the strategic and technical depth expected of a Principal Data Engineer. It moves beyond "making it work" to "making it scale, endure, and deliver value." Use this skill for architectural decisions, high-stakes code reviews, and establishing robust engineering patterns.

Core Capabilities

1. Data Platform Architecture

Focus on the "-ilities": Scalability, Reliability, Maintainability, and Observability.

  • Design for Failure: Assume every component will fail. Build retries, dead-letter queues, and circuit breakers.
  • Idempotency: All pipelines must be re-runnable without side effects.
  • Decoupling: Separate compute from storage; separate orchestration (Airflow) from execution (Spark/Snowflake/dbt).
  • Cost Awareness: Design schemas and compute usage (e.g., partition strategies) to minimize cost at scale.

2. Pipeline Engineering Standards

Enforce strict standards for Airflow and Python code.

  • No Top-Level Code: Strictly adhere to Airflow best practices to prevent scheduler overload.
  • Idempotency: All DAG tasks must be re-runnable without side effects or data duplication.
  • Atomic Tasks: Each task should do one thing. If it fails, it should be clear what failed.
  • Functional Patterns: Prefer clear inputs and outputs over shared global state.
  • TaskFlow API: Use Airflow 2.0+ decorator syntax (@dag, @task) for clarity and type safety.
  • Testing:
    • Unit: Test transform logic in isolation.
    • Integration: Test DAG integrity and component connectivity with DagBag.
    • Data Quality: Validate data "in-flight" (pre-condition/post-condition checks).

Critical Anti-Patterns to Avoid:

  • ❌ Top-level code execution (runs on every scheduler loop)
  • ❌ Non-idempotent operations (append without deduplication)
  • ❌ Direct metadata database access (use Airflow's public API)
  • ❌ Hardcoded credentials (use Airflow Connections and Variables)
  • ❌ Excessive dynamic DAG generation (use dynamic task mapping instead)

See airflow-best-practices.md for comprehensive patterns and examples.

3. Data Quality & Observability

Quality is not an afterthought; it is a pipeline dependency.

  • Data Contracts: Use datacontract-cli with ODCS to define explicit contracts between producers and consumers.
  • Data Quality Checks: Use Soda for declarative data quality validation integrated into pipelines.
  • SLA/SLO Monitoring: Alert not just on failure, but on lateness (missing SLAs).
  • Data Lineage: Ensure transformations are traceable from source to sink (OpenLineage, dbt docs).

4. Composable Data Stack

Leverage the composable data stack—swap any component without rewriting the entire pipeline.

  • Ingestion: Prefer dlt for robust, schema-aware ELT. Use dlt init <source> <destination> to scaffold pipelines.
  • Processing: Default to DuckDB, Polars, and Apache Arrow for single-node processing (faster/cheaper than Spark for small/medium data).
  • Embedded OLAP: Use DuckDB for local development, testing, and file-based querying (S3/Parquet).
  • Portable Code: Use Ibis to decouple transformation logic from execution engines (run on DuckDB locally, Snowflake in prod).
  • Open Table Format: Apache Iceberg for lakehouse architectures—schema evolution, time-travel, partition evolution.
  • Transformation: dbt for SQL-first transformations with built-in testing and documentation.
  • Data Contracts: datacontract-cli with Open Data Contract Standard (ODCS) for producer/consumer agreements.

Usage Guidelines

When to use

  • Architectural Reviews: "Review this proposed architecture for the new streaming platform."
  • Complex Debugging: "The scheduler is lagging, and tasks are getting stuck. Help diagnose."
  • Standard Setting: "Create a template for a standardized ingestion pipeline."

Key Questions to Ask

  • "Is this pipeline idempotent? What happens if I run it twice?"
  • "How do we backfill historical data with this design?"
  • "What is the recovery time objective (RTO) for this dataset?"

Resources

references/

scripts/

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Team Composition Analysis

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

artdesign

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Skill Information

Category:Creative
License:Proprietary. See parent repository LICENSE
Last Updated:2/1/2026