Data Analyst
by SankaiAI
A data analyst skill that guides Claude through structured, professional data analysis workflows. Use this skill when the user requests data analysis work including analyze, query, dashboard, metrics, EDA, cohort, funnel, or A/B testing.
Skill Details
Repository Files
16 files in this skill directory
name: data-analyst description: A data analyst skill that guides Claude through structured, professional data analysis workflows. Use this skill when the user requests data analysis work including analyze, query, dashboard, metrics, EDA, cohort, funnel, or A/B testing.
CrushData AI - Data Analyst Skill
A data analyst intelligence skill that guides you through structured, professional data analysis workflows.
How to Use This Skill
When user requests data analysis work (analyze, query, dashboard, metrics, EDA, cohort, funnel, A/B test), follow this workflow:
Step 1: Discovery Protocol (MANDATORY)
Before writing any code, ask the user:
## Discovery Questions
1. **Business Context**
- What business question should this analysis answer?
- Who is the audience? (Executive, Analyst, Engineer)
- What action will this analysis inform?
2. **Data Context**
- Which tables/databases contain the relevant data?
- What time range should I analyze?
- Any known data quality issues?
3. **Metric Definitions**
- How does YOUR company define the key metrics?
- Any filters to apply? (exclude test users, internal accounts?)
- What timezone should I use for dates?
4. **Script Organization**
- Save all analysis scripts in an `analysis/` folder.
- Create this folder if it does not exist.
5. **Python Environment**
- Check for `venv` or `.venv`. If missing, `python3 -m venv venv`.
- Install dependencies in venv (`venv/bin/pip install`).
- Run scripts using venv python (`venv/bin/python`).
6. **Reports**
- Save all profiling, validation, and findings to `reports/` folder.
- Create this folder if it does not exist.
1b. Secure Data Access
Credentials are stored in the project's
.envfile - never hardcoded.
- Check Connections: Run
npx crushdataai connectionsfirst. - Missing Data?: If the data source is not listed, INSTRUCT the user to run:
npx crushdataai connect - Discover Schema: Use
npx crushdataai schema <connection> [table]to see available tables and columns. - Get Code: ALWAYS use
npx crushdataai snippet <name>to get loading code (uses env vars). - Load .env: Scripts use
os.environ["VAR"]. Ensurepython-dotenvis installed and.envis loaded. - Security: DO NOT ask user to copy/move files to
data/. Treat connected data as read-only.
Step 2: Search Relevant Domains
Use search.py to gather comprehensive information:
python3 .claude/skills/data-analyst/scripts/search.py "<query>" --domain <domain> [-n 3]
Available domains:
| Domain | Use Case |
|---|---|
workflow |
Step-by-step analysis process |
metric |
Metric definitions and formulas |
chart |
Visualization recommendations |
cleaning |
Data quality patterns |
sql |
SQL patterns (window functions, cohorts) |
python |
pandas/polars code snippets |
database |
PostgreSQL, BigQuery, Snowflake tips |
report |
Dashboard UX guidelines |
validation |
Common mistakes to avoid |
Industry-specific search:
python3 .claude/skills/data-analyst/scripts/search.py "<query>" --industry saas|ecommerce|finance|marketing
Recommended search order:
workflow- Get the step-by-step process for this analysis typemetricor--industry- Get relevant metric definitionssqlorpython- Get code patterns for implementationchart- Get visualization recommendationsvalidation- Check for common mistakes to avoid
Step 3: Data Profiling (MANDATORY Before Analysis)
Before any analysis, run profiling:
Python:
print(f"Shape: {df.shape}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Sample:\n{df.sample(5)}")
SQL:
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT user_id) as unique_users,
MIN(date) as min_date,
MAX(date) as max_date
FROM table;
Report findings to user before proceeding:
"I found X rows, Y unique users, date range from A to B. Does this match your expectation?"
Step 3b: Data Cleaning & Transformation (ETL)
Address data quality issues found in Step 3:
- Cleaning: Handle missing values, remove invalid duplicates, fix types.
- Transformation: Standardize categories, parse dates, normalize text.
- Feature Engineering: Create calculated columns needed for metrics.
Tip: Save cleaning scripts to etl/ folder. Save processed data to data/processed/.
Step 4: Execute Analysis with Validation
Before JOINs:
- Run on 100 rows first
- Check: Did row count change unexpectedly?
- Ask: "The join produced X rows from Y. Expected?"
Before Aggregations:
- Check for duplicates that could inflate sums
- Verify granularity: "Is this one row per user per day?"
- Ask: "Total = $X. Does this seem reasonable?"
Before Delivery:
- Sanity check order of magnitude
- Compare to benchmark or prior period
- Present for user validation before finalizing
Workflow Reference
| Analysis Type | Search Command |
|---|---|
| EDA | --domain workflow query "exploratory data analysis" |
| Dashboard | --domain workflow query "dashboard creation" |
| A/B Test | --domain workflow query "ab test" |
| Cohort | --domain workflow query "cohort analysis" |
| Funnel | --domain workflow query "funnel analysis" |
| Time Series | --domain workflow query "time series" |
| Segmentation | --domain workflow query "customer segmentation" |
| Data Cleaning | --domain workflow query "data cleaning" |
Common Rules
- Always ask before assuming - Metric definitions vary by company
- Profile data first - Never aggregate without understanding the data
- Validate results - Check totals, compare to benchmarks
- Document assumptions - State what filters and definitions you used
- Show your work - Explain the logic behind complex queries
Step 5: Generate Dashboard Output
After analysis, save results for dashboard visualization:
-
Search for best chart type:
python3 .claude/skills/data-analyst/scripts/search.py "<metric description>" --domain chart -
Create dashboard JSON:
from pathlib import Path import json from datetime import datetime Path("reports/dashboards").mkdir(parents=True, exist_ok=True) dashboard = { "metadata": { "title": "Analysis Dashboard", "generatedAt": datetime.now().isoformat(), "dataRange": f"{start_date} to {end_date}" }, "kpis": [ {"id": "kpi-1", "label": "Total Revenue", "value": "$50,000", "trend": "+12%", "trendDirection": "up"} ], "charts": [ { "id": "chart-1", "type": "line", # line, bar, pie, area, scatter, donut, table "title": "Monthly Trend", "data": { "labels": ["Jan", "Feb", "Mar"], "datasets": [{"label": "Revenue", "values": [10000, 15000, 25000]}] } } ] } with open("reports/dashboards/dashboard.json", "w") as f: json.dump(dashboard, f, indent=2)
5b. Making Charts Refreshable (Recommended)
To allow the user to refresh data directly from the dashboard:
- Include a
queryobject in the chart definition. - Run
npx crushdataai connectionsto list available connection names (this is secure - no passwords shown). - Set
connectionto one of the listed names. - Set
sqlto the query used to generate the data.
SECURITY: Never read
.envdirectly to find connection names. Always usenpx crushdataai connections.
"query": {
"connection": "my_postgres_db",
"sql": "SELECT date, revenue FROM sales WHERE..."
}
Database Specifics:
- SQL/Databases: Provide the full SQL query.
- Shopify: Provide the resource name (e.g.
orders). - CSV: Provide the connection name.
sqlis ignored but required (set to "default"). - MongoDB: Provide the collection name in the
sqlfield.
Script-Based Refresh (for Python-aggregated charts):
If your chart requires Python aggregation (e.g., grouping, custom calculations), use script instead of connection:
"query": {
"script": "analysis/my_dashboard_script.py"
}
When the user clicks Refresh, the CLI will re-run your Python script. The script should update the dashboard JSON file with fresh aggregated data.
TIP: Use
scriptfor Shopify/API charts that need aggregation. Useconnection+sqlonly for SQL databases where the query returns pre-formatted chart data.
- Tell user:
"Dashboard ready! Run
npx crushdataai dashboardto view."
Pre-Delivery Checklist
Before presenting final results:
- Confirmed business question is answered
- Data was profiled and validated
- Metric definitions match user's expectations
- Sanity checks pass (order of magnitude, trends, etc.)
- Visualizations follow best practices (search
--domain chart) - Assumptions and filters are documented
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
