Schema Mapper

by nimrodfisher

documentdata

Database schema understanding and relationship mapping. Use when exploring unfamiliar databases, documenting table relationships, identifying join paths, or generating ERD documentation for existing schemas.

Skill Details

Repository Files

1 file in this skill directory


name: schema-mapper description: Database schema understanding and relationship mapping. Use when exploring unfamiliar databases, documenting table relationships, identifying join paths, or generating ERD documentation for existing schemas.

Schema Mapper

Quick Start

Automatically discover, document, and visualize database schemas including tables, columns, relationships, and join paths. Essential for understanding unfamiliar databases or creating documentation.

Context Requirements

Before mapping the schema, I need:

  1. Database Access: Connection details or schema export
  2. Scope: Which tables/schemas to map (or all)
  3. Documentation Goal: What you need (ERD, join paths, data dictionary, etc.)
  4. Known Relationships (optional): Explicit foreign keys or implicit relationships

Context Gathering

For Database Access:

"I can map your schema from:

Option 1 - Direct Database Connection:

connection_details = {
    'host': 'your-db.example.com',
    'database': 'production',
    'user': 'readonly_user',
    'password': '***',
    'type': 'postgresql'  # or mysql, snowflake, bigquery, redshift
}

Option 2 - Schema Export:

-- PostgreSQL:
SELECT table_schema, table_name, column_name, data_type, 
       is_nullable, column_default
FROM information_schema.columns
WHERE table_schema = 'public'
ORDER BY table_name, ordinal_position;

-- Also export constraints:
SELECT * FROM information_schema.table_constraints;
SELECT * FROM information_schema.key_column_usage;

Option 3 - dbt Project: Share your dbt_project.yml and models/ directory - I'll extract schema from dbt docs

Option 4 - Existing Documentation: Share any ERDs, data dictionaries, or schema docs you have"

For Scope:

"Should I map:

  • All tables in the database?
  • Specific schema (e.g., 'public', 'analytics', 'staging')?
  • Specific tables (list the important ones)?
  • Tables matching pattern (e.g., all 'fct_' and 'dim_' tables)?"

For Documentation Goal:

"What do you need from this schema mapping?

Common Goals:

  1. Visual ERD - See table relationships graphically
  2. Join Path Finder - How to join Table A to Table B
  3. Data Dictionary - Complete catalog of tables and columns
  4. Lineage Map - Understand data flow through tables
  5. Quick Reference - Cheat sheet for common joins

Which would be most useful?"

For Relationships:

"I'll auto-detect relationships from:

  • Foreign key constraints
  • Column name patterns (id, user_id, etc.)
  • Naming conventions

If you have implicit relationships (not enforced by FK constraints), let me know:

  • Example: 'orders.customer_email relates to customers.email but no FK exists'"

Workflow

Step 1: Connect and Discover Schema

import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine, inspect

# Connect to database
engine = create_engine(connection_string)
inspector = inspect(engine)

# Get all schemas
schemas = inspector.get_schema_names()
print(f"šŸ“š Schemas found: {schemas}")

# Get tables in target schema
tables = inspector.get_table_names(schema='public')
print(f"šŸ“Š Tables in 'public': {len(tables)}")
for table in sorted(tables):
    print(f"  - {table}")

Checkpoint: "Found {N} tables in schema. Does this look right? Any tables missing or unexpected?"

Step 2: Extract Table Metadata

def extract_table_metadata(engine, schema, table_name):
    """Extract comprehensive metadata for a single table"""
    inspector = inspect(engine)
    
    # Get columns
    columns = inspector.get_columns(table_name, schema=schema)
    
    # Get primary keys
    pk = inspector.get_pk_constraint(table_name, schema=schema)
    pk_columns = pk.get('constrained_columns', [])
    
    # Get foreign keys
    fks = inspector.get_foreign_keys(table_name, schema=schema)
    
    # Get indexes
    indexes = inspector.get_indexes(table_name, schema=schema)
    
    # Get unique constraints
    unique = inspector.get_unique_constraints(table_name, schema=schema)
    
    return {
        'table_name': table_name,
        'schema': schema,
        'columns': columns,
        'primary_keys': pk_columns,
        'foreign_keys': fks,
        'indexes': indexes,
        'unique_constraints': unique
    }

# Extract metadata for all tables
schema_metadata = {}
for table in tables:
    schema_metadata[table] = extract_table_metadata(engine, 'public', table)
    print(f"āœ“ Extracted metadata for {table}")

Step 3: Infer Relationships

def infer_relationships(schema_metadata):
    """
    Infer relationships from:
    1. Explicit foreign keys
    2. Column naming patterns (user_id → users.id)
    3. Common patterns (created_by → users.id)
    """
    relationships = []
    
    for table_name, metadata in schema_metadata.items():
        # Explicit FKs
        for fk in metadata['foreign_keys']:
            relationships.append({
                'type': 'explicit_fk',
                'from_table': table_name,
                'from_column': fk['constrained_columns'][0],
                'to_table': fk['referred_table'],
                'to_column': fk['referred_columns'][0],
                'confidence': 'high'
            })
        
        # Inferred from naming patterns
        for col in metadata['columns']:
            col_name = col['name']
            
            # Pattern: table_id → table.id
            if col_name.endswith('_id') and col_name != 'id':
                potential_table = col_name[:-3] + 's'  # users_id → users
                if potential_table in schema_metadata:
                    relationships.append({
                        'type': 'inferred_naming',
                        'from_table': table_name,
                        'from_column': col_name,
                        'to_table': potential_table,
                        'to_column': 'id',
                        'confidence': 'medium'
                    })
    
    return relationships

relationships = infer_relationships(schema_metadata)
print(f"šŸ”— Found {len(relationships)} relationships")

Step 4: Generate Data Dictionary

def generate_data_dictionary(schema_metadata):
    """Create comprehensive data dictionary"""
    
    dictionary = []
    
    for table_name, metadata in schema_metadata.items():
        for col in metadata['columns']:
            dictionary.append({
                'table': table_name,
                'column': col['name'],
                'type': str(col['type']),
                'nullable': col['nullable'],
                'default': col.get('default'),
                'is_pk': col['name'] in metadata['primary_keys'],
                'is_fk': any(col['name'] in fk['constrained_columns'] 
                           for fk in metadata['foreign_keys'])
            })
    
    df_dict = pd.DataFrame(dictionary)
    return df_dict

data_dict = generate_data_dictionary(schema_metadata)
print(f"\nšŸ“– Data Dictionary:")
print(data_dict.head(20))

# Save to CSV
data_dict.to_csv('data_dictionary.csv', index=False)

Step 5: Find Join Paths

def find_join_path(from_table, to_table, relationships):
    """
    Find how to join from_table to to_table
    Using BFS to find shortest path
    """
    from collections import deque
    
    # Build adjacency graph
    graph = {}
    for rel in relationships:
        if rel['from_table'] not in graph:
            graph[rel['from_table']] = []
        graph[rel['from_table']].append({
            'to_table': rel['to_table'],
            'from_col': rel['from_column'],
            'to_col': rel['to_column'],
            'type': rel['type']
        })
    
    # BFS to find path
    queue = deque([(from_table, [])])
    visited = {from_table}
    
    while queue:
        current_table, path = queue.popleft()
        
        if current_table == to_table:
            return path
        
        if current_table in graph:
            for edge in graph[current_table]:
                if edge['to_table'] not in visited:
                    visited.add(edge['to_table'])
                    new_path = path + [edge]
                    queue.append((edge['to_table'], new_path))
    
    return None  # No path found

# Example: How to join orders to customers?
path = find_join_path('orders', 'customers', relationships)

if path:
    print("\nšŸ›¤ļø  Join Path: orders → customers")
    for i, step in enumerate(path, 1):
        print(f"  Step {i}: JOIN {step['to_table']} " +
              f"ON {path[i-1]['to_table'] if i > 1 else 'orders'}.{step['from_col']} = " +
              f"{step['to_table']}.{step['to_col']}")
else:
    print("āŒ No join path found")

Step 6: Generate ERD

def generate_erd_mermaid(schema_metadata, relationships):
    """Generate Mermaid ERD diagram"""
    
    mermaid = ["erDiagram"]
    
    # Add tables with columns
    for table_name, metadata in schema_metadata.items():
        mermaid.append(f"    {table_name.upper()} {{")
        
        for col in metadata['columns'][:10]:  # Limit to key columns
            col_type = str(col['type'])[:20]
            pk_marker = " PK" if col['name'] in metadata['primary_keys'] else ""
            fk_marker = " FK" if any(col['name'] in fk['constrained_columns'] 
                                    for fk in metadata['foreign_keys']) else ""
            
            mermaid.append(f"        {col_type} {col['name']}{pk_marker}{fk_marker}")
        
        mermaid.append("    }")
    
    # Add relationships
    for rel in relationships:
        if rel['type'] == 'explicit_fk':
            mermaid.append(
                f"    {rel['from_table'].upper()} ||--o{{ " +
                f"{rel['to_table'].upper()} : {rel['from_column']}"
            )
    
    return "\n".join(mermaid)

erd = generate_erd_mermaid(schema_metadata, relationships)
print("\nšŸ“Š ERD (Mermaid format):")
print(erd)

# Save to file
with open('schema_erd.mmd', 'w') as f:
    f.write(erd)

Step 7: Generate Quick Reference Guide

def generate_quick_reference(schema_metadata, relationships):
    """Create quick reference for common joins"""
    
    ref = []
    ref.append("# Schema Quick Reference\n")
    
    # Table summary
    ref.append("## Tables Overview\n")
    for table_name, metadata in sorted(schema_metadata.items()):
        row_count = f"(~{metadata.get('row_count', '?')} rows)" if 'row_count' in metadata else ""
        ref.append(f"- **{table_name}** {row_count}")
        ref.append(f"  - Primary Key: {', '.join(metadata['primary_keys']) or 'None'}")
        ref.append(f"  - Columns: {len(metadata['columns'])}\n")
    
    # Common joins
    ref.append("\n## Common Join Patterns\n")
    
    # Group relationships by from_table
    joins_by_table = {}
    for rel in relationships:
        if rel['from_table'] not in joins_by_table:
            joins_by_table[rel['from_table']] = []
        joins_by_table[rel['from_table']].append(rel)
    
    for table, rels in sorted(joins_by_table.items()):
        ref.append(f"\n### From {table}:\n")
        for rel in rels:
            ref.append(
                f"```sql\n"
                f"JOIN {rel['to_table']} ON {table}.{rel['from_column']} = " +
                f"{rel['to_table']}.{rel['to_column']}\n"
                f"```\n"
            )
    
    return "\n".join(ref)

quick_ref = generate_quick_reference(schema_metadata, relationships)
print(quick_ref)

# Save to markdown
with open('schema_quick_reference.md', 'w') as f:
    f.write(quick_ref)

Context Validation

Before proceeding, verify:

  • Database connection works or schema export is complete
  • Target tables/schemas are clearly defined
  • Have permissions to query INFORMATION_SCHEMA
  • Understand which relationships are explicit vs inferred
  • Know what documentation format is needed

Output Template

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SCHEMA MAPPING REPORT
Database: production_db
Schema: public
Generated: 2025-01-11
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

šŸ“Š SCHEMA OVERVIEW
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Total Tables: 42
Total Columns: 387
Total Relationships: 56
  - Explicit (FK): 38
  - Inferred: 18

šŸ—‚ļø  TABLE CATEGORIES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Fact Tables (fct_*): 8
Dimension Tables (dim_*): 12
Staging Tables (stg_*): 15
Raw Tables (raw_*): 7

šŸ“‹ KEY TABLES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. users (dim_users)
   - PK: id
   - Columns: 15
   - Referenced by: orders, sessions, events

2. orders (fct_orders)
   - PK: id
   - FK: user_id → users.id, product_id → products.id
   - Columns: 23

3. products (dim_products)
   - PK: id
   - Columns: 12
   - Referenced by: orders, cart_items

šŸ”— RELATIONSHIP MAP
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

orders → users (user_id)
orders → products (product_id)  
sessions → users (user_id)
events → sessions (session_id)
cart_items → products (product_id)

šŸ›¤ļø  COMMON JOIN PATHS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

users → orders:
  JOIN orders ON users.id = orders.user_id

orders → products:
  JOIN products ON orders.product_id = products.id

users → events (via sessions):
  JOIN sessions ON users.id = sessions.user_id
  JOIN events ON sessions.id = events.session_id

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
FILES GENERATED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

āœ“ schema_erd.mmd (Mermaid ERD)
āœ“ data_dictionary.csv (All columns)
āœ“ schema_quick_reference.md (Join guide)
āœ“ relationship_graph.json (Machine-readable)

Common Scenarios

Scenario 1: "New to the database, need overview"

→ Full schema map with ERD and data dictionary → Highlight fact and dimension tables → Show most common join patterns

Scenario 2: "How do I join Table A to Table B?"

→ Use find_join_path to show exact SQL → If no direct path, show multi-hop joins → Validate path makes business sense

Scenario 3: "Document schema for new team members"

→ Generate comprehensive quick reference → Include business context for key tables → Create visual ERD for overview

Scenario 4: "Find all tables related to users"

→ Traverse relationship graph from users table → Show both direct and indirect relationships → Categorize by relationship type

Scenario 5: "Validate schema against expectations"

→ Compare actual schema to documented schema → Flag missing tables, columns, relationships → Identify schema drift

Handling Missing Context

User provides database but no context: "I'll map the entire schema and present an overview. You can then tell me which areas to explore in depth."

User has partial access (read-only): "I'll work with INFORMATION_SCHEMA queries which don't require write permissions. This will show structure but not data samples."

User has dbt project instead of database: "Great! I'll extract schema from your dbt models and documentation. This often has better business context than raw database schema."

User wants specific tables only: "I'll focus on those tables and map their direct relationships. Let me know if you want me to expand to related tables."

Foreign keys not enforced: "I'll infer relationships from column naming patterns. Review the 'inferred' relationships and confirm which are correct."

Advanced Options

After basic schema mapping, offer:

Lineage Tracking: "Want me to trace data lineage? I can show how data flows from raw tables through transformations to final fact tables."

Schema Comparison: "I can compare schemas across environments (dev vs prod) or over time to identify changes."

Query Pattern Analysis: "If you provide query logs, I can identify most-used join patterns and optimize documentation around them."

dbt Integration: "I can generate dbt schema.yml files from this mapping to document your models."

Auto-Documentation: "I can set up automated schema documentation that updates when schema changes."

Performance Insights: "Based on table sizes and join patterns, I can identify potential performance bottlenecks."

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Clinical Decision Support

Generate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug develo

developmentdocumentcli

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Skill Information

Category:Document
Last Updated:1/14/2026