Data Analysis
by pjordan
Statistical analysis and data exploration workflows for effective data-driven insights. Use when analyzing datasets, performing exploratory data analysis (EDA), statistical hypothesis testing, data visualization planning, finding patterns and insights, or preparing analysis reports. Triggers include "analyze data", "data analysis", "statistics", "EDA", "exploratory analysis", "data patterns", or when working with datasets requiring statistical insights.
Skill Details
Repository Files
2 files in this skill directory
name: data-analysis description: Statistical analysis and data exploration workflows for effective data-driven insights. Use when analyzing datasets, performing exploratory data analysis (EDA), statistical hypothesis testing, data visualization planning, finding patterns and insights, or preparing analysis reports. Triggers include "analyze data", "data analysis", "statistics", "EDA", "exploratory analysis", "data patterns", or when working with datasets requiring statistical insights.
Data Analysis Skill
Statistical analysis and data exploration workflows for effective data-driven insights.
Overview
This skill provides Claude with a systematic approach to data analysis, including exploratory data analysis (EDA), statistical testing, visualization recommendations, and insight generation.
When to Use
- Analyzing datasets
- Exploratory data analysis
- Statistical hypothesis testing
- Data visualization planning
- Finding patterns and insights
- Preparing analysis reports
Prerequisites
- Basic statistics knowledge (mean, median, standard deviation, etc.)
- Understanding of data types (categorical, numerical)
- Familiarity with common data issues (missing values, outliers)
Instructions
Analysis Workflow
-
Understand the Data
- What is the source?
- What questions need answering?
- What's the business context?
-
Initial Exploration
- Number of rows and columns
- Data types of each column
- Missing value patterns
- Basic statistics (min, max, mean, etc.)
-
Data Quality Check
- Missing values: count and pattern
- Outliers: identify using IQR or z-scores
- Duplicates: check for repeated rows
- Inconsistencies: data type issues, invalid values
-
Descriptive Statistics
- Central tendency (mean, median, mode)
- Spread (standard deviation, range)
- Distribution shape (skewness, kurtosis)
- Percentiles (quartiles, deciles)
-
Relationships
- Correlations between variables
- Group comparisons
- Trends over time
- Categorical associations
-
Visualization Planning
- Choose appropriate chart types
- Identify key variables to visualize
- Consider audience and purpose
-
Insights & Recommendations
- Summarize key findings
- Answer original questions
- Suggest next steps
Statistical Tests Decision Tree
Comparing Two Groups:
- Numerical outcome → T-test or Mann-Whitney U
- Categorical outcome → Chi-square or Fisher's exact
Comparing Multiple Groups:
- Numerical outcome → ANOVA or Kruskal-Wallis
- Categorical outcome → Chi-square
Relationships:
- Two numerical variables → Pearson or Spearman correlation
- Numerical outcome with predictors → Regression analysis
Visualization Guidelines
Distributions:
- Histogram: single numerical variable
- Box plot: compare distributions across groups
- Density plot: smooth distribution view
Comparisons:
- Bar chart: categorical comparisons
- Box plot: numerical across categories
- Violin plot: distribution + comparison
Relationships:
- Scatter plot: two numerical variables
- Line chart: trends over time
- Heatmap: correlation matrix
Compositions:
- Pie chart: parts of a whole (use sparingly)
- Stacked bar: composition over categories
- Treemap: hierarchical composition
Examples
Example 1: Sales Data Analysis
Data:
date,product,sales,region
2024-01-01,A,150,North
2024-01-01,B,200,South
2024-01-02,A,180,North
...
Analysis Approach:
-
Initial Questions:
- Which products sell best?
- Are there regional differences?
- What are the trends over time?
-
Exploration:
# Shape and types print(df.shape) # (365, 4) print(df.dtypes) # Missing values print(df.isnull().sum()) # Basic statistics print(df['sales'].describe()) -
Key Metrics:
- Total sales by product
- Average sales per region
- Monthly sales trends
- Sales distribution
-
Visualizations:
- Line chart: sales trend over time
- Bar chart: total sales by product
- Grouped bar: sales by product and region
- Box plot: sales distribution by region
-
Statistical Tests:
- ANOVA: Do regions differ significantly?
- Correlation: Relationship between date and sales
- T-test: Product A vs B sales comparison
-
Insights:
- "Product B outsells Product A by 25% on average"
- "North region shows 15% higher sales than South"
- "Sales peak in Q4, suggesting seasonal pattern"
Example 2: Customer Survey Analysis
Data:
satisfaction,age,subscription_length
5,34,12
3,45,3
4,28,18
...
Analysis:
-
Data Quality:
- Check satisfaction scale (1-5)
- Verify age ranges (18-100)
- Identify missing values
-
Descriptive Stats:
- Mean satisfaction: 3.8
- Age distribution: mostly 25-45
- Subscription length: median 12 months
-
Relationships:
- Correlation: satisfaction vs. subscription_length (r = 0.6)
- Age groups: No significant difference in satisfaction
-
Segmentation:
- New users (<6 months): satisfaction = 3.2
- Established users (6-24 months): satisfaction = 4.1
- Long-term users (>24 months): satisfaction = 4.3
-
Visualization:
- Histogram: satisfaction score distribution
- Scatter: age vs satisfaction (colored by subscription length)
- Box plot: satisfaction across subscription length groups
-
Key Findings:
- "Satisfaction strongly correlates with subscription length"
- "New users are 28% less satisfied than established users"
- "Focus retention efforts on first 6 months"
Best Practices
Do's
- Always start with data quality checks
- Use appropriate statistical tests
- Consider practical significance, not just statistical
- Visualize before and during analysis
- Document assumptions
- Provide context for findings
Don'ts
- Don't assume causation from correlation
- Don't ignore missing data patterns
- Don't over-interpret small effects
- Don't use inappropriate tests
- Don't cherry-pick results
- Don't forget to check assumptions
Common Pitfalls
Simpson's Paradox
A trend in aggregated data can reverse when data is split into groups. Always check subgroups.
Multiple Testing
Running many tests increases false positive risk. Adjust significance levels accordingly.
Outliers
Outliers can dramatically affect results. Always identify and consider their impact.
Sample Size
Small samples may not generalize. Report sample sizes and confidence intervals.
Output Format
Structure analysis as:
## Executive Summary
[Brief overview of key findings]
## Data Overview
- Rows: X
- Columns: Y
- Time period: Z
- Key variables: ...
## Data Quality
[Missing values, outliers, issues]
## Descriptive Statistics
[Central tendency, spread, distribution]
## Key Relationships
[Correlations, comparisons, trends]
## Visualizations
[Recommended charts and what they show]
## Statistical Tests
[Tests performed and results]
## Insights
[Bullet points of key findings]
## Recommendations
[Action items based on analysis]
## Limitations
[Data quality issues, assumptions, caveats]
Tools and Resources
Python libraries:
- pandas: Data manipulation
- numpy: Numerical operations
- matplotlib/seaborn: Visualization
- scipy: Statistical tests
- statsmodels: Advanced statistics
R packages:
- dplyr: Data manipulation
- ggplot2: Visualization
- tidyr: Data tidying
- stats: Statistical tests
Limitations
- Can't execute code (provide code suggestions only)
- Can't see actual data (work from descriptions)
- Statistical advice is general (not specialized consulting)
- Can't create visualizations (can recommend them)
Related Skills
- Data Cleaning (preprocessing focus)
- Statistical Modeling (advanced techniques)
- Data Visualization (chart design)
Author
Claude Toolkit Community
Version
- Created: 2025-01-18
- Last Updated: 2025-01-18
- Version: 1.0.0
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
