Bigquery Etl Core

by mozilla

data

The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.

Skill Details

Repository Files

10 files in this skill directory


name: bigquery-etl-core description: The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.

BigQuery ETL Core

Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction

Project Overview

The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.

Each table/query typically consists of three files:

  • query.sql OR query.py - The query definition (SQL or Python)
  • metadata.yaml - Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
  • schema.yaml - BigQuery schema definition with field types and descriptions (see metadata-manager skill)

Note: Most tables use query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

🚨 REQUIRED READING - Start Here

When starting work in bigquery-etl, READ these foundational references:

  1. Naming Conventions: READ references/naming_conventions.md

    • Table naming patterns
    • Dataset organization
    • Version suffix conventions
  2. Dataset Organization: READ references/dataset_naming_conventions.md

    • Common dataset suffixes (_derived, _stable, _live)
    • When to use each dataset type
    • Dataset naming rules
  3. Schema Resources: READ references/discovery_resources.md

    • Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
    • Priority order for schema lookup during construction
    • Common mozfun UDFs
  4. Privacy Guidelines: READ references/privacy_guidelines.md

    • Data handling requirements
    • PII considerations
    • Workgroup access patterns

Directory Structure

sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml

See assets/directory_structure_example.txt for detailed examples.

Key principles:

  • Always flat: sql/{project}/{dataset}/{table_name}/
  • Never use subdirectories within table directories
  • Table names always include version suffix (_v1, _v2, etc.)

Schema & Description Resources for Construction

Finding Schema Descriptions

Priority order for schema lookup during construction:

  1. Local files first: Check sql/*/schema.yaml and metadata.yaml files

    • Most reliable and up-to-date source
    • Contains field descriptions written by table owners
  2. Glean Dictionary: For _live and _stable tables

  3. ProbeInfo API: For Glean metric metadata

    • Endpoints: https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics
    • Provides metric definitions and descriptions programmatically
    • Use for validating metric references in queries
  4. DataHub MCP: Only as last resort

    • MUST READ references/datahub_best_practices.md BEFORE any DataHub queries
    • Use for schema lookup when not available in local files or Glean Dictionary
    • Extract ONLY necessary fields (column names, types, descriptions)
    • Use for downstream impact analysis when modifying tables

See references/discovery_resources.md for:

  • Detailed guidance on each schema source
  • ProbeInfo API endpoints and usage patterns
  • Glean Dictionary URL patterns for different products
  • DataHub MCP best practices for construction
  • Common mozfun UDFs
  • Key documentation links

Naming Conventions

Table Names:

  • Use snake_case with version suffix: clients_daily_event_v1
  • Common suffixes: _daily, _hourly, _aggregates, _summary

Field Names:

  • Use snake_case: submission_date, client_id, n_total_events
  • Prefix counts with n_: n_events, n_sessions
  • Standard Mozilla fields: submission_date, client_id, sample_id, normalized_channel, normalized_country_code, app_version

See references/naming_conventions.md for:

  • Complete naming patterns and conventions
  • Reserved/common patterns to avoid
  • BigQuery project naming conventions

Dataset Organization

See references/dataset_naming_conventions.md for:

  • Dataset naming patterns by suffix (_derived, _external, etc.)
  • Common dataset prefixes by product/source
  • Table versioning patterns
  • Incremental vs full refresh query patterns

Privacy & Data Handling

Mozilla follows strict data privacy policies:

  • No PII in derived tables
  • Use client-level identifiers (client_id) not individual identifiers
  • Respect data retention policies (~2 years for client-level data)
  • Label client-level tables with table_type: client_level in metadata.yaml

See references/privacy_guidelines.md for:

  • Key principles from Mozilla's data platform
  • Geo IP lookup and user agent parsing policies
  • Best practices for data handling
  • Deletion request support
  • Sample ID usage for sampling

BigQuery & Mozilla Conventions

Partitioning & Clustering

  • Most tables use day partitioning on submission_date
  • Clustering improves query performance for filtered/joined fields
  • See metadata-manager skill for detailed partitioning and clustering configuration

Common UDFs (mozfun)

Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

Common functions:

  • mozfun.map.get_key() - Extract values from key-value maps
  • mozfun.norm.truncate_version() - Normalize version strings
  • mozfun.stats.mode_last() - Statistical mode calculation

UDF source code in sql/mozfun/ directory.

Glean Overview

Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

Key concepts:

  • Metric types: Counter, boolean, string, event, etc.
  • Pings: Collections of metrics (e.g., baseline, events, metrics)
  • Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

Common Glean datasets in BigQuery:

  • Pattern: {app_id}.{ping_name} (e.g., org_mozilla_fenix.baseline)
  • All have auto-generated schemas based on metric definitions

See references/glean_overview.md for:

  • What is Glean and how it differs from Firefox Desktop Telemetry
  • Glean SDK and metric type details
  • Common Glean datasets in BigQuery
  • When to use Glean Dictionary

bigquery-etl CLI Commands

See references/bqetl_cli_commands.md for:

  • Key bqetl CLI commands for query creation, validation, schema updates
  • How to find the right DAG for scheduling
  • Backfill creation commands

Best Practices

General principles:

  • Always include field descriptions in schema.yaml (see metadata-manager skill)
  • Add header comments explaining query purpose (see query-writer skill)
  • Reference bug/ticket numbers for context
  • Document any data exclusions or filtering logic

See assets/query_structure_example.sql for standard query structure.

Version migration:

  • Create new _v2 table when making breaking schema changes
  • Keep _v1 running during migration period
  • Update views to point to new version
  • Coordinate with downstream consumers before deprecating old version

For detailed best practices, see:

Integration with Other Skills

bigquery-etl-core serves as the foundation skill that other skills build upon:

Works with model-requirements

  • Provides naming conventions for new tables and datasets
  • Supplies common field naming patterns for requirements gathering
  • Offers privacy guidelines for data model planning

Works with query-writer

  • Provides project structure and naming conventions
  • Supplies common patterns and mozfun UDF references
  • Offers schema description lookup guidance for construction

Works with metadata-manager

  • Provides DAG naming patterns and scheduling conventions
  • Supplies partitioning and clustering best practices
  • Offers ownership and labeling patterns

Works with sql-test-generator

  • Provides test structure and fixture naming conventions
  • Supplies common table patterns for test creation
  • Offers query parameter conventions

Works with bigconfig-generator

  • Provides table naming conventions for Bigeye monitoring configuration
  • Supplies dataset organization patterns
  • Offers field naming standards for data quality checks

This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.

Reference Examples

Real query examples in the repository:

  • Simple query: sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql
  • Aggregation with GROUP BY: sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql
  • Complex query with CTEs: sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql
  • Python ETL (INFORMATION_SCHEMA): sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py
  • Python ETL (External API): sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py

For more examples, explore the sql/moz-fx-data-shared-prod/ directory.

Bundled Resources

References

  • references/discovery_resources.md - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
  • references/naming_conventions.md - Complete naming patterns for tables, fields, and projects
  • references/dataset_naming_conventions.md - Dataset organization and versioning patterns
  • references/privacy_guidelines.md - Mozilla data privacy policies and best practices
  • references/glean_overview.md - Glean SDK concepts and BigQuery dataset structures
  • references/bqetl_cli_commands.md - Key CLI commands and DAG discovery

DataHub Usage (CRITICAL for Token Efficiency)

BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:

  • READ references/datahub_best_practices.md - Comprehensive token optimization strategies
  • Follow priority order: local files → documentation → DataHub (only as last resort)
  • Use search-first patterns and extract minimal fields from responses

Assets

  • assets/query_structure_example.sql - Standard query.sql structure with common patterns
  • assets/directory_structure_example.txt - File organization examples

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:12/12/2025