Analyze Data
by dtbuchholz
Skill Details
Repository Files
1 file in this skill directory
name: analyze-data description: Analyze a dataset with parallel EDA agents and generate insights. Creates distribution analysis, missing data reports, correlation analysis, outlier detection, and visualizations.
Analyze Data
Perform comprehensive data analysis using parallel specialist agents. Generates insights, visualizations, and recommendations.
When This Skill Applies
- User provides a dataset path (CSV, Parquet, JSON)
- User asks to analyze or explore data
- User wants to understand data quality or distributions
Data Path
The user should provide a path to the data file. If not provided:
- Look for data files:
find . -name "*.csv" -o -name "*.parquet" -o -name "*.json" | head -10 - Ask user: "Which dataset would you like to analyze?"
Workflow
Step 1: Initial Data Load and Profile
Load the data and generate a quick profile:
import pandas as pd
import numpy as np
# Load data (detect format)
df = pd.read_csv('[data_path]') # or read_parquet, read_json
# Quick profile
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Sample:\n{df.head()}")
Step 2: Ask Analysis Questions
Use AskUserQuestion to clarify:
-
Analysis goal: "What question are you trying to answer with this data?"
- Exploratory analysis (understand the data)
- Predictive modeling (predict a target)
- Statistical testing (compare groups)
- Time series analysis (forecast trends)
-
Target variable (if applicable): "Which column is the target/outcome you want to predict or analyze?"
-
Key dimensions: "Which columns represent important groups or segments to analyze?"
Step 3: Launch Parallel Analysis Agents
CRITICAL: Launch ALL agents in a SINGLE message.
Task (model: haiku, subagent_type: general-purpose): "DISTRIBUTION ANALYSIS
Analyze distributions for this dataset:
- Load: [data_path]
- For each numeric column: compute mean, median, std, skewness, kurtosis
- Check normality (Shapiro-Wilk for n<5000, else D'Agostino)
- Identify heavily skewed columns (|skew| > 1)
- For each categorical column: value counts, cardinality
Output:
- Table of distribution stats
- List of columns needing transformation
- Anomalies found"
Task (model: haiku, subagent_type: general-purpose): "MISSING DATA ANALYSIS
Analyze missing data patterns:
- Load: [data_path]
- Missing count and percentage per column
- Missing data patterns (MCAR, MAR, MNAR indicators)
- Correlations between missingness
- Columns with >50% missing (candidates for dropping)
Output:
- Missing data summary table
- Pattern analysis
- Imputation recommendations"
Task (model: haiku, subagent_type: general-purpose): "CORRELATION ANALYSIS
Analyze relationships:
- Load: [data_path]
- Pearson correlations for numeric columns
- High correlations (|r| > 0.7) - multicollinearity risks
- Target correlations if target specified: [target]
- Cramér's V for categorical associations
Output:
- Top 10 correlations
- Multicollinearity warnings
- Feature importance ranking (if target)"
Task (model: haiku, subagent_type: general-purpose): "OUTLIER ANALYSIS
Detect outliers:
- Load: [data_path]
- IQR method for each numeric column
- Z-score method (|z| > 3)
- Isolation Forest for multivariate outliers
- Business logic outliers (negative prices, future dates, etc.)
Output:
- Outlier counts per column
- Most extreme values
- Recommended handling"
Task (model: sonnet, subagent_type: general-purpose): "VISUALIZATION GENERATION
Create key visualizations:
- Load: [data_path]
- Distribution plots for top numeric columns
- Correlation heatmap
- Target distribution (if applicable)
- Time trends (if datetime columns exist)
- Category breakdowns
Save plots to: ./analysis_output/
Use: matplotlib, seaborn
Output: List of generated plot files"
Step 4: If Predictive Modeling Requested
Launch additional modeling agents:
Task (model: sonnet, subagent_type: general-purpose): "BASELINE MODELING
Build baseline models:
- Load: [data_path]
- Target: [target]
- Train/test split (80/20, stratified if classification)
- Baseline: DummyClassifier/DummyRegressor
- Simple model: LogisticRegression or LinearRegression
- Tree model: RandomForestClassifier/Regressor
Report:
- Baseline performance
- Simple model performance
- Feature importances from tree model
- Recommended next steps"
Task (model: haiku, subagent_type: general-purpose): "FEATURE ENGINEERING SUGGESTIONS
Based on data profile, suggest features:
- Log transforms for skewed numerics
- Binning strategies
- Interaction terms
- Date feature extraction
- Encoding strategies for categoricals
- Aggregation features if hierarchical data
Output: Prioritized list of feature engineering ideas"
Step 5: Synthesize Results
Collect all agent outputs and create unified report:
# Data Analysis Report: [Dataset Name]
**Generated:** [Date] **Dataset:** [Path] **Shape:** [Rows] x [Columns]
## Executive Summary
[2-3 key findings with metrics]
## Data Quality
### Missing Data
[From missing data agent]
### Outliers
[From outlier agent]
### Data Types
[Column type summary]
## Key Distributions
[Distribution insights + plots]
## Relationships
### Correlations
[Top correlations, multicollinearity warnings]
### Target Analysis (if applicable)
[Target distribution, key predictors]
## Visualizations
[Links/embeds to generated plots]
## Recommendations
### Data Cleaning
1. [Specific action]
2. [Specific action]
### Feature Engineering
1. [Specific suggestion]
2. [Specific suggestion]
### Modeling (if applicable)
- Baseline performance: [metric]
- Recommended approach: [algorithm]
- Key features: [list]
## Next Steps
1. [Action item]
2. [Action item]
Step 6: Save Outputs
mkdir -p analysis_output
Save:
analysis_output/report.md- Full analysis reportanalysis_output/data_profile.json- Structured data profileanalysis_output/*.png- Visualizationsanalysis_output/notebook.ipynb- Reproducible notebook (optional)
Output
Provide the user with:
- Executive summary (3-5 bullet points)
- Path to full report
- Key visualizations inline
- Recommended next steps
Tips
- For large files (>100MB), use
polarsorduckdbinstead of pandas - For notebooks, use NotebookEdit to create reproducible analysis
- Reference the
data-scienceskill for methodology details
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
