Data Analysis

by ScientiaCapital

data

Executive-grade data analysis with pandas/polars and McKinsey-quality visualizations. Use when analyzing data, building dashboards, creating investor presentations, or calculating SaaS metrics.

Skill Details

Repository Files

5 files in this skill directory


name: "data-analysis" description: "Executive-grade data analysis with pandas/polars and McKinsey-quality visualizations. Use when analyzing data, building dashboards, creating investor presentations, or calculating SaaS metrics."

<quick_start> Universal data loader:

df = load_data("file.csv")  # Supports CSV, Excel, JSON, Parquet, PDF, PPTX

SaaS metrics:

metrics = calculate_saas_metrics(df)  # MRR, ARR, LTV, CAC, churn
retention = cohort_retention_analysis(df)  # Retention matrix

McKinsey-style charts: Action titles ("Q4 Revenue Exceeded Target by 23%"), not descriptive titles </quick_start>

<success_criteria> Analysis is successful when:

  • Data loaded and cleaned (dropna, dedup, type conversion)
  • Metrics calculated correctly (MRR, ARR, LTV:CAC, churn, cohort retention)
  • Charts follow McKinsey principles: action titles, data-ink ratio >80%, one message per chart
  • Executive colors used (#003366 primary, #2E7D32 positive, #C62828 negative)
  • Streamlit dashboard runs without errors
  • NO OPENAI: Use Claude for narrative generation if needed </success_criteria>

<core_content> Executive-grade data analysis for VC, PE, C-suite presentations using pandas, polars, Plotly, Altair, and Streamlit.

Quick Reference

Task Tools Output
Data ingestion pandas, polars, pdfplumber, python-pptx DataFrame
Wrangling pandas/polars transforms Clean dataset
Analysis numpy, scipy, statsmodels Insights
Visualization Plotly, Altair, Seaborn Charts
Dashboards Streamlit, DuckDB Interactive apps
Presentations Plotly export, PDF generation Investor-ready

Data Ingestion Patterns

Universal Data Loader

import pandas as pd
import polars as pl
from pathlib import Path

def load_data(file_path: str) -> pd.DataFrame:
    """Load data from any common format."""
    path = Path(file_path)
    suffix = path.suffix.lower()

    loaders = {
        '.csv': lambda p: pd.read_csv(p),
        '.xlsx': lambda p: pd.read_excel(p, engine='openpyxl'),
        '.xls': lambda p: pd.read_excel(p, engine='xlrd'),
        '.json': lambda p: pd.read_json(p),
        '.parquet': lambda p: pd.read_parquet(p),
        '.sql': lambda p: pd.read_sql(open(p).read(), conn),
        '.md': lambda p: parse_markdown_tables(p),
        '.pdf': lambda p: extract_pdf_tables(p),
        '.pptx': lambda p: extract_pptx_tables(p),
    }

    if suffix not in loaders:
        raise ValueError(f"Unsupported format: {suffix}")

    return loaders[suffix](path)

PDF Table Extraction

import pdfplumber

def extract_pdf_tables(pdf_path: str) -> pd.DataFrame:
    """Extract tables from PDF using pdfplumber."""
    all_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            tables = page.extract_tables()
            for table in tables:
                if table and len(table) > 1:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    all_tables.append(df)

    return pd.concat(all_tables, ignore_index=True) if all_tables else pd.DataFrame()

PowerPoint Data Extraction

from pptx import Presentation
from pptx.util import Inches

def extract_pptx_tables(pptx_path: str) -> list[pd.DataFrame]:
    """Extract all tables from PowerPoint."""
    prs = Presentation(pptx_path)
    tables = []

    for slide in prs.slides:
        for shape in slide.shapes:
            if shape.has_table:
                table = shape.table
                data = []
                for row in table.rows:
                    data.append([cell.text for cell in row.cells])
                df = pd.DataFrame(data[1:], columns=data[0])
                tables.append(df)

    return tables

Data Wrangling Patterns

Polars for Performance (30x faster than pandas)

import polars as pl

# Lazy evaluation for large datasets
df = (
    pl.scan_csv("large_file.csv")
    .filter(pl.col("revenue") > 0)
    .with_columns([
        (pl.col("revenue") / pl.col("customers")).alias("arpu"),
        pl.col("date").str.to_date().alias("date_parsed"),
    ])
    .group_by("segment")
    .agg([
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("customers").mean().alias("avg_customers"),
    ])
    .collect()
)

Common Transformations

def prepare_for_analysis(df: pd.DataFrame) -> pd.DataFrame:
    """Standard data prep pipeline."""
    return (df
        .dropna(subset=['key_column'])
        .drop_duplicates()
        .assign(
            date=lambda x: pd.to_datetime(x['date']),
            revenue=lambda x: pd.to_numeric(x['revenue'], errors='coerce'),
            month=lambda x: x['date'].dt.to_period('M'),
        )
        .sort_values('date')
        .reset_index(drop=True)
    )

SaaS Metrics Calculations

Core Metrics

def calculate_saas_metrics(df: pd.DataFrame) -> dict:
    """Calculate key SaaS metrics for investor reporting."""

    # MRR / ARR
    mrr = df.groupby('month')['mrr'].sum()
    arr = mrr.iloc[-1] * 12

    # Growth rates
    mrr_growth = mrr.pct_change().iloc[-1]

    # Churn
    churned = df[df['status'] == 'churned']['mrr'].sum()
    total_mrr = df['mrr'].sum()
    churn_rate = churned / total_mrr if total_mrr > 0 else 0

    # CAC & LTV
    total_sales_marketing = df['sales_cost'].sum() + df['marketing_cost'].sum()
    new_customers = df[df['is_new']]['customer_id'].nunique()
    cac = total_sales_marketing / new_customers if new_customers > 0 else 0

    avg_revenue_per_customer = df.groupby('customer_id')['mrr'].mean().mean()
    avg_lifespan_months = 1 / churn_rate if churn_rate > 0 else 36
    ltv = avg_revenue_per_customer * avg_lifespan_months

    ltv_cac_ratio = ltv / cac if cac > 0 else 0
    cac_payback_months = cac / avg_revenue_per_customer if avg_revenue_per_customer > 0 else 0

    return {
        'mrr': mrr.iloc[-1],
        'arr': arr,
        'mrr_growth': mrr_growth,
        'churn_rate': churn_rate,
        'cac': cac,
        'ltv': ltv,
        'ltv_cac_ratio': ltv_cac_ratio,
        'cac_payback_months': cac_payback_months,
    }

Cohort Analysis

def cohort_retention_analysis(df: pd.DataFrame) -> pd.DataFrame:
    """Build cohort retention matrix for investor reporting."""

    # Assign cohort (first purchase month)
    df['cohort'] = df.groupby('customer_id')['date'].transform('min').dt.to_period('M')
    df['period'] = df['date'].dt.to_period('M')
    df['cohort_age'] = (df['period'] - df['cohort']).apply(lambda x: x.n)

    # Build retention matrix
    cohort_data = df.groupby(['cohort', 'cohort_age']).agg({
        'customer_id': 'nunique',
        'revenue': 'sum'
    }).reset_index()

    # Pivot for visualization
    cohort_counts = cohort_data.pivot(
        index='cohort',
        columns='cohort_age',
        values='customer_id'
    )

    # Calculate retention percentages
    cohort_sizes = cohort_counts.iloc[:, 0]
    retention = cohort_counts.divide(cohort_sizes, axis=0) * 100

    return retention

Executive Visualization

McKinsey/BCG Chart Principles

mckinsey_style:
  colors:
    primary: "#003366"      # Deep blue
    accent: "#0066CC"       # Bright blue
    positive: "#2E7D32"     # Green
    negative: "#C62828"     # Red
    neutral: "#757575"      # Gray

  typography:
    title: "Georgia, serif"
    body: "Arial, sans-serif"
    size_title: 18
    size_body: 12

  principles:
    - "One message per chart"
    - "Action title (not descriptive)"
    - "Data-ink ratio > 80%"
    - "Remove chartjunk"
    - "Label directly on chart"

Plotly Executive Charts

import plotly.express as px
import plotly.graph_objects as go

EXEC_COLORS = {
    'primary': '#003366',
    'secondary': '#0066CC',
    'positive': '#2E7D32',
    'negative': '#C62828',
    'neutral': '#757575',
}

def exec_line_chart(df, x, y, title):
    """McKinsey-style line chart."""
    fig = px.line(df, x=x, y=y)

    fig.update_layout(
        title=dict(
            text=f"<b>{title}</b>",
            font=dict(size=18, family="Georgia"),
            x=0,
        ),
        font=dict(family="Arial", size=12),
        plot_bgcolor='white',
        xaxis=dict(showgrid=False, showline=True, linecolor='black'),
        yaxis=dict(showgrid=True, gridcolor='#E0E0E0', showline=True, linecolor='black'),
        margin=dict(l=60, r=40, t=60, b=40),
    )

    fig.update_traces(line=dict(color=EXEC_COLORS['primary'], width=3))

    return fig

def exec_waterfall(values, labels, title):
    """Waterfall chart for revenue/cost breakdown."""
    fig = go.Figure(go.Waterfall(
        orientation="v",
        measure=["relative"] * (len(values) - 1) + ["total"],
        x=labels,
        y=values,
        connector=dict(line=dict(color="rgb(63, 63, 63)")),
        increasing=dict(marker=dict(color=EXEC_COLORS['positive'])),
        decreasing=dict(marker=dict(color=EXEC_COLORS['negative'])),
        totals=dict(marker=dict(color=EXEC_COLORS['primary'])),
    ))

    fig.update_layout(
        title=dict(text=f"<b>{title}</b>", font=dict(size=18, family="Georgia")),
        font=dict(family="Arial", size=12),
        plot_bgcolor='white',
        showlegend=False,
    )

    return fig

Cohort Heatmap

def cohort_heatmap(retention_df, title="Customer Retention by Cohort"):
    """Publication-quality cohort retention heatmap."""
    import plotly.figure_factory as ff

    fig = px.imshow(
        retention_df.values,
        labels=dict(x="Months Since Acquisition", y="Cohort", color="Retention %"),
        x=list(retention_df.columns),
        y=[str(c) for c in retention_df.index],
        color_continuous_scale='Blues',
        aspect='auto',
    )

    # Add text annotations
    for i, row in enumerate(retention_df.values):
        for j, val in enumerate(row):
            if not pd.isna(val):
                fig.add_annotation(
                    x=j, y=i,
                    text=f"{val:.0f}%",
                    showarrow=False,
                    font=dict(color='white' if val > 50 else 'black', size=10)
                )

    fig.update_layout(
        title=dict(text=f"<b>{title}</b>", font=dict(size=18, family="Georgia")),
        font=dict(family="Arial", size=12),
    )

    return fig

Streamlit Dashboard Template

import streamlit as st
import pandas as pd
import plotly.express as px

st.set_page_config(page_title="Executive Dashboard", layout="wide")

# Custom CSS for executive styling
st.markdown("""
<style>
    .metric-card {
        background: linear-gradient(135deg, #003366, #0066CC);
        padding: 20px;
        border-radius: 10px;
        color: white;
    }
    .stMetric label { font-family: Georgia, serif; }
</style>
""", unsafe_allow_html=True)

# Header
st.title("Executive Dashboard")
st.markdown("---")

# KPI Row
col1, col2, col3, col4 = st.columns(4)

with col1:
    st.metric("MRR", f"${mrr:,.0f}", f"{mrr_growth:+.1%}")
with col2:
    st.metric("ARR", f"${arr:,.0f}", f"{arr_growth:+.1%}")
with col3:
    st.metric("LTV:CAC", f"{ltv_cac:.1f}x", delta_color="normal")
with col4:
    st.metric("Churn", f"{churn:.1%}", f"{churn_delta:+.1%}", delta_color="inverse")

# Charts Row
st.markdown("## Revenue Trend")
st.plotly_chart(exec_line_chart(df, 'month', 'revenue', 'MRR Growth Exceeds Target'), use_container_width=True)

# Cohort Analysis
st.markdown("## Cohort Retention")
st.plotly_chart(cohort_heatmap(retention_df), use_container_width=True)

Investor Presentation Patterns

Pitch Deck Metrics Sequence

investor_metrics_flow:
  1_unit_economics:
    charts: ["CAC vs LTV bar", "LTV:CAC trend line"]
    key_message: "3x+ LTV:CAC proves efficient growth"

  2_mrr_waterfall:
    charts: ["MRR waterfall (new, expansion, churn, contraction)"]
    key_message: "Net revenue retention > 100%"

  3_cohort_retention:
    charts: ["Cohort heatmap", "Revenue retention curve"]
    key_message: "Strong retention = compounding value"

  4_growth_efficiency:
    charts: ["Magic Number", "CAC payback period"]
    key_message: "Efficient growth engine"

  5_projections:
    charts: ["ARR projection with scenarios"]
    key_message: "Clear path to $X ARR"

Action Titles (McKinsey Style)

## Bad (Descriptive) → Good (Action)

❌ "Revenue by Quarter"
✅ "Q4 Revenue Exceeded Target by 23%"

❌ "Customer Acquisition Cost"
✅ "CAC Decreased 40% While Maintaining Quality"

❌ "Cohort Analysis"
✅ "90-Day Retention Improved to 85%, Up From 72%"

❌ "Market Size"
✅ "TAM of $4.2B with Clear Path to $500M SAM"

Quick Commands

# Load and analyze any file
df = load_data("data.csv")
metrics = calculate_saas_metrics(df)
retention = cohort_retention_analysis(df)

# Generate executive charts
fig = exec_line_chart(df, 'month', 'mrr', 'MRR Growth Accelerating')
fig.write_html("mrr_chart.html")
fig.write_image("mrr_chart.png", scale=2)

# Run Streamlit dashboard
# streamlit run dashboard.py

Integration Notes

  • Pairs with: revenue-ops-skill (metrics), pricing-strategy-skill (modeling)
  • Stack: Python 3.11+, pandas, polars, plotly, altair, streamlit
  • Projects: coperniq-forge (ROI calculators), thetaroom (trading analysis)
  • NO OPENAI: Use Claude for narrative generation

Reference Files

  • reference/chart-gallery.md - 20+ chart templates with code
  • reference/saas-metrics.md - Complete SaaS KPI definitions
  • reference/streamlit-patterns.md - Production dashboard patterns
  • reference/data-wrangling.md - Format-specific extraction guides

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:1/19/2026