Data Analysis

by NeverSight

artdata

Analyze datasets using Python with pandas, numpy, and visualization libraries. Generates statistical summaries, identifies patterns, creates charts, and provides insights. Use when analyzing CSV/Excel files, exploring data, creating visualizations, or when users mention data analysis, statistics, charts, or datasets.

Skill Details


name: data-analysis description: Analyze datasets using Python with pandas, numpy, and visualization libraries. Generates statistical summaries, identifies patterns, creates charts, and provides insights. Use when analyzing CSV/Excel files, exploring data, creating visualizations, or when users mention data analysis, statistics, charts, or datasets. license: MIT metadata: author: agent-skills-demo version: "1.0" category: data-science compatibility: Requires Python 3.8+ with pandas, numpy, matplotlib, and seaborn

Data Analysis Skill

When to Use This Skill

Use this skill when:

  • Analyzing datasets (CSV, Excel, JSON)
  • Performing statistical analysis
  • Creating data visualizations
  • Identifying trends and patterns
  • Data cleaning and preprocessing
  • Users mention "analyze data", "statistics", "charts", "trends", or "insights"

Analysis Process

1. Data Loading & Initial Exploration

Load the data:

import pandas as pd
import numpy as np

# CSV files
df = pd.read_csv('data.csv')

# Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON files
df = pd.read_json('data.json')

# From database
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:pass@localhost/db')
df = pd.read_sql('SELECT * FROM table', engine)

Initial exploration:

# Basic information
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# First look at data
print("\nFirst 5 rows:")
print(df.head())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Basic statistics
print("\nDescriptive statistics:")
print(df.describe())

2. Data Cleaning

Handle missing values:

# Check missing data patterns
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print("Missing data percentage:")
print(missing_pct[missing_pct > 0])

# Drop columns with too many missing values
df = df.drop(columns=missing_pct[missing_pct > 50].index)

# Fill missing values
df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True)
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)

# Or drop rows with missing values
df = df.dropna()

Handle duplicates:

# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

# Remove duplicates
df = df.drop_duplicates()

# Keep specific duplicates
df = df.drop_duplicates(subset=['id'], keep='first')

Data type conversions:

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])

# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert to category (saves memory)
df['category'] = df['category'].astype('category')

3. Statistical Analysis

See references/STATISTICS.md for detailed formulas.

Descriptive statistics:

# Central tendency
mean = df['column'].mean()
median = df['column'].median()
mode = df['column'].mode()[0]

# Dispersion
std = df['column'].std()
variance = df['column'].var()
range_val = df['column'].max() - df['column'].min()
iqr = df['column'].quantile(0.75) - df['column'].quantile(0.25)

# Distribution
skewness = df['column'].skew()
kurtosis = df['column'].kurtosis()

print(f"""
Statistics for {column}:
  Mean: {mean:.2f}
  Median: {median:.2f}
  Std Dev: {std:.2f}
  Range: {range_val:.2f}
  IQR: {iqr:.2f}
  Skewness: {skewness:.2f}
""")

Correlation analysis:

# Correlation matrix
correlation = df[numeric_columns].corr()
print(correlation)

# Find strong correlations
strong_corr = correlation[(correlation > 0.7) | (correlation < -0.7)]
strong_corr = strong_corr[strong_corr != 1.0].stack()
print("\nStrong correlations:")
print(strong_corr)

Group analysis:

# Group by categorical variable
grouped = df.groupby('category').agg({
    'sales': ['sum', 'mean', 'count'],
    'profit': ['sum', 'mean'],
    'quantity': 'sum'
})
print(grouped)

# Multiple grouping
df.groupby(['region', 'category'])['sales'].sum().unstack()

4. Data Visualization

Distribution plots:

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Histogram
plt.figure()
df['column'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.savefig('histogram.png', dpi=300, bbox_inches='tight')
plt.close()

# Box plot
plt.figure()
df.boxplot(column='value', by='category')
plt.title('Value by Category')
plt.suptitle('')  # Remove default title
plt.savefig('boxplot.png', dpi=300, bbox_inches='tight')
plt.close()

# Violin plot
plt.figure()
sns.violinplot(data=df, x='category', y='value')
plt.title('Value Distribution by Category')
plt.savefig('violin.png', dpi=300, bbox_inches='tight')
plt.close()

Relationship plots:

# Scatter plot
plt.figure()
plt.scatter(df['x'], df['y'], alpha=0.5)
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.title('X vs Y')
plt.savefig('scatter.png', dpi=300, bbox_inches='tight')
plt.close()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.close()

# Pair plot for multiple variables
sns.pairplot(df[['var1', 'var2', 'var3', 'category']], hue='category')
plt.savefig('pairplot.png', dpi=300, bbox_inches='tight')
plt.close()

Time series plots:

# Line plot
plt.figure()
df.set_index('date')['value'].plot()
plt.title('Value Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.savefig('timeseries.png', dpi=300, bbox_inches='tight')
plt.close()

# Multiple time series
df.pivot(index='date', columns='category', values='value').plot()
plt.title('Values by Category Over Time')
plt.legend(title='Category')
plt.savefig('timeseries_multi.png', dpi=300, bbox_inches='tight')
plt.close()

Categorical plots:

# Bar plot
category_counts = df['category'].value_counts()
plt.figure()
category_counts.plot(kind='bar')
plt.title('Count by Category')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.savefig('barplot.png', dpi=300, bbox_inches='tight')
plt.close()

# Stacked bar plot
df.groupby(['region', 'category'])['sales'].sum().unstack().plot(kind='bar', stacked=True)
plt.title('Sales by Region and Category')
plt.savefig('stacked_bar.png', dpi=300, bbox_inches='tight')
plt.close()

5. Advanced Analysis

Trend detection:

from scipy import stats

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['x'], df['y'])
print(f"Trend: slope={slope:.4f}, R²={r_value**2:.4f}, p={p_value:.4f}")

# Moving average
df['ma_7'] = df['value'].rolling(window=7).mean()
df['ma_30'] = df['value'].rolling(window=30).mean()

Outlier detection:

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['column']))
outliers = df[z_scores > 3]
print(f"Outliers detected: {len(outliers)}")

# IQR method
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column'] < Q1 - 1.5*IQR) | (df['column'] > Q3 + 1.5*IQR)]
print(f"Outliers by IQR: {len(outliers)}")

Statistical tests:

from scipy import stats

# T-test (compare two groups)
group1 = df[df['category'] == 'A']['value']
group2 = df[df['category'] == 'B']['value']
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"T-test: t={t_stat:.4f}, p={p_value:.4f}")

# ANOVA (compare multiple groups)
groups = [df[df['category'] == cat]['value'] for cat in df['category'].unique()]
f_stat, p_value = stats.f_oneway(*groups)
print(f"ANOVA: F={f_stat:.4f}, p={p_value:.4f}")

# Chi-square test (categorical variables)
contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-square: χ²={chi2:.4f}, p={p_value:.4f}")

6. Generate Report

Use the analysis script:

python scripts/analyze.py --file data.csv --output report.html

Create summary:

summary = f"""
# Data Analysis Report

## Dataset Overview
- Rows: {len(df):,}
- Columns: {len(df.columns)}
- Date range: {df['date'].min()} to {df['date'].max()}

## Key Findings

### 1. [Finding Title]
{description_of_finding}

### 2. [Finding Title]
{description_of_finding}

## Statistical Summary
{df.describe().to_markdown()}

## Recommendations
1. [Recommendation based on analysis]
2. [Recommendation based on analysis]
"""

with open('analysis_report.md', 'w') as f:
    f.write(summary)

Best Practices

Memory Optimization

# Read large files in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    processed_chunk = chunk.process()  # Your processing
    chunks.append(processed_chunk)
df = pd.concat(chunks)

# Optimize data types
df['int_col'] = df['int_col'].astype('int32')  # Instead of int64
df['float_col'] = df['float_col'].astype('float32')  # Instead of float64

Performance Tips

# Use vectorized operations instead of loops
# Bad
result = []
for value in df['column']:
    result.append(value * 2)

# Good
result = df['column'] * 2

# Use .query() for filtering
df_filtered = df.query('age > 30 and city == "NYC"')

# Use .loc for setting values
df.loc[df['age'] > 30, 'category'] = 'senior'

Reproducibility

# Set random seed
np.random.seed(42)

# Save processed data
df.to_csv('processed_data.csv', index=False)
df.to_parquet('processed_data.parquet')  # Better for large datasets

# Export analysis
import pickle
with open('analysis_results.pkl', 'wb') as f:
    pickle.dump({'stats': stats, 'model': model}, f)

Common Analysis Types

Sales Analysis

# Total sales by period
sales_by_month = df.groupby(df['date'].dt.to_period('M'))['sales'].sum()

# Top products
top_products = df.groupby('product')['sales'].sum().sort_values(ascending=False).head(10)

# Growth rate
df['growth_rate'] = df['sales'].pct_change() * 100

Customer Analysis

# Customer segmentation
df['segment'] = pd.cut(df['total_purchases'], 
                       bins=[0, 100, 500, float('inf')],
                       labels=['Low', 'Medium', 'High'])

# Retention analysis
cohort = df.groupby(['cohort_month', 'purchase_month']).size()

Performance Analysis

# Year-over-year comparison
df['year'] = df['date'].dt.year
yoy = df.groupby('year')['metric'].sum()
yoy_growth = yoy.pct_change() * 100

Error Handling

try:
    df = pd.read_csv('data.csv')
except FileNotFoundError:
    print("Error: File not found")
    sys.exit(1)
except pd.errors.EmptyDataError:
    print("Error: File is empty")
    sys.exit(1)
except Exception as e:
    print(f"Error loading data: {e}")
    sys.exit(1)

# Validate data
assert not df.empty, "DataFrame is empty"
assert 'required_column' in df.columns, "Missing required column"
assert df['date'].dtype == 'datetime64[ns]', "Date column not in datetime format"

Output Guidelines

Always provide:

  1. Summary: High-level findings in plain language
  2. Statistics: Key numbers and metrics
  3. Visualizations: Charts that support findings
  4. Insights: Actionable conclusions
  5. Recommendations: Next steps based on analysis

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Team Composition Analysis

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

artdesign

Startup Financial Modeling

This skill should be used when the user asks to "create financial projections", "build a financial model", "forecast revenue", "calculate burn rate", "estimate runway", "model cash flow", or requests 3-5 year financial planning for a startup.

art

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Startup Metrics Framework

This skill should be used when the user asks about "key startup metrics", "SaaS metrics", "CAC and LTV", "unit economics", "burn multiple", "rule of 40", "marketplace metrics", or requests guidance on tracking and optimizing business performance metrics.

art

Skill Information

Category:Creative
License:MIT
Version:1.0
Last Updated:1/28/2026