Data Analysis
by NeverSight
Analyze datasets using Python with pandas, numpy, and visualization libraries. Generates statistical summaries, identifies patterns, creates charts, and provides insights. Use when analyzing CSV/Excel files, exploring data, creating visualizations, or when users mention data analysis, statistics, charts, or datasets.
Skill Details
name: data-analysis description: Analyze datasets using Python with pandas, numpy, and visualization libraries. Generates statistical summaries, identifies patterns, creates charts, and provides insights. Use when analyzing CSV/Excel files, exploring data, creating visualizations, or when users mention data analysis, statistics, charts, or datasets. license: MIT metadata: author: agent-skills-demo version: "1.0" category: data-science compatibility: Requires Python 3.8+ with pandas, numpy, matplotlib, and seaborn
Data Analysis Skill
When to Use This Skill
Use this skill when:
- Analyzing datasets (CSV, Excel, JSON)
- Performing statistical analysis
- Creating data visualizations
- Identifying trends and patterns
- Data cleaning and preprocessing
- Users mention "analyze data", "statistics", "charts", "trends", or "insights"
Analysis Process
1. Data Loading & Initial Exploration
Load the data:
import pandas as pd
import numpy as np
# CSV files
df = pd.read_csv('data.csv')
# Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON files
df = pd.read_json('data.json')
# From database
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:pass@localhost/db')
df = pd.read_sql('SELECT * FROM table', engine)
Initial exploration:
# Basic information
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# First look at data
print("\nFirst 5 rows:")
print(df.head())
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
# Basic statistics
print("\nDescriptive statistics:")
print(df.describe())
2. Data Cleaning
Handle missing values:
# Check missing data patterns
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print("Missing data percentage:")
print(missing_pct[missing_pct > 0])
# Drop columns with too many missing values
df = df.drop(columns=missing_pct[missing_pct > 50].index)
# Fill missing values
df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True)
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)
# Or drop rows with missing values
df = df.dropna()
Handle duplicates:
# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
# Remove duplicates
df = df.drop_duplicates()
# Keep specific duplicates
df = df.drop_duplicates(subset=['id'], keep='first')
Data type conversions:
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Convert to category (saves memory)
df['category'] = df['category'].astype('category')
3. Statistical Analysis
See references/STATISTICS.md for detailed formulas.
Descriptive statistics:
# Central tendency
mean = df['column'].mean()
median = df['column'].median()
mode = df['column'].mode()[0]
# Dispersion
std = df['column'].std()
variance = df['column'].var()
range_val = df['column'].max() - df['column'].min()
iqr = df['column'].quantile(0.75) - df['column'].quantile(0.25)
# Distribution
skewness = df['column'].skew()
kurtosis = df['column'].kurtosis()
print(f"""
Statistics for {column}:
Mean: {mean:.2f}
Median: {median:.2f}
Std Dev: {std:.2f}
Range: {range_val:.2f}
IQR: {iqr:.2f}
Skewness: {skewness:.2f}
""")
Correlation analysis:
# Correlation matrix
correlation = df[numeric_columns].corr()
print(correlation)
# Find strong correlations
strong_corr = correlation[(correlation > 0.7) | (correlation < -0.7)]
strong_corr = strong_corr[strong_corr != 1.0].stack()
print("\nStrong correlations:")
print(strong_corr)
Group analysis:
# Group by categorical variable
grouped = df.groupby('category').agg({
'sales': ['sum', 'mean', 'count'],
'profit': ['sum', 'mean'],
'quantity': 'sum'
})
print(grouped)
# Multiple grouping
df.groupby(['region', 'category'])['sales'].sum().unstack()
4. Data Visualization
Distribution plots:
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Histogram
plt.figure()
df['column'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.savefig('histogram.png', dpi=300, bbox_inches='tight')
plt.close()
# Box plot
plt.figure()
df.boxplot(column='value', by='category')
plt.title('Value by Category')
plt.suptitle('') # Remove default title
plt.savefig('boxplot.png', dpi=300, bbox_inches='tight')
plt.close()
# Violin plot
plt.figure()
sns.violinplot(data=df, x='category', y='value')
plt.title('Value Distribution by Category')
plt.savefig('violin.png', dpi=300, bbox_inches='tight')
plt.close()
Relationship plots:
# Scatter plot
plt.figure()
plt.scatter(df['x'], df['y'], alpha=0.5)
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.title('X vs Y')
plt.savefig('scatter.png', dpi=300, bbox_inches='tight')
plt.close()
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.close()
# Pair plot for multiple variables
sns.pairplot(df[['var1', 'var2', 'var3', 'category']], hue='category')
plt.savefig('pairplot.png', dpi=300, bbox_inches='tight')
plt.close()
Time series plots:
# Line plot
plt.figure()
df.set_index('date')['value'].plot()
plt.title('Value Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.savefig('timeseries.png', dpi=300, bbox_inches='tight')
plt.close()
# Multiple time series
df.pivot(index='date', columns='category', values='value').plot()
plt.title('Values by Category Over Time')
plt.legend(title='Category')
plt.savefig('timeseries_multi.png', dpi=300, bbox_inches='tight')
plt.close()
Categorical plots:
# Bar plot
category_counts = df['category'].value_counts()
plt.figure()
category_counts.plot(kind='bar')
plt.title('Count by Category')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.savefig('barplot.png', dpi=300, bbox_inches='tight')
plt.close()
# Stacked bar plot
df.groupby(['region', 'category'])['sales'].sum().unstack().plot(kind='bar', stacked=True)
plt.title('Sales by Region and Category')
plt.savefig('stacked_bar.png', dpi=300, bbox_inches='tight')
plt.close()
5. Advanced Analysis
Trend detection:
from scipy import stats
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['x'], df['y'])
print(f"Trend: slope={slope:.4f}, R²={r_value**2:.4f}, p={p_value:.4f}")
# Moving average
df['ma_7'] = df['value'].rolling(window=7).mean()
df['ma_30'] = df['value'].rolling(window=30).mean()
Outlier detection:
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['column']))
outliers = df[z_scores > 3]
print(f"Outliers detected: {len(outliers)}")
# IQR method
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column'] < Q1 - 1.5*IQR) | (df['column'] > Q3 + 1.5*IQR)]
print(f"Outliers by IQR: {len(outliers)}")
Statistical tests:
from scipy import stats
# T-test (compare two groups)
group1 = df[df['category'] == 'A']['value']
group2 = df[df['category'] == 'B']['value']
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"T-test: t={t_stat:.4f}, p={p_value:.4f}")
# ANOVA (compare multiple groups)
groups = [df[df['category'] == cat]['value'] for cat in df['category'].unique()]
f_stat, p_value = stats.f_oneway(*groups)
print(f"ANOVA: F={f_stat:.4f}, p={p_value:.4f}")
# Chi-square test (categorical variables)
contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-square: χ²={chi2:.4f}, p={p_value:.4f}")
6. Generate Report
Use the analysis script:
python scripts/analyze.py --file data.csv --output report.html
Create summary:
summary = f"""
# Data Analysis Report
## Dataset Overview
- Rows: {len(df):,}
- Columns: {len(df.columns)}
- Date range: {df['date'].min()} to {df['date'].max()}
## Key Findings
### 1. [Finding Title]
{description_of_finding}
### 2. [Finding Title]
{description_of_finding}
## Statistical Summary
{df.describe().to_markdown()}
## Recommendations
1. [Recommendation based on analysis]
2. [Recommendation based on analysis]
"""
with open('analysis_report.md', 'w') as f:
f.write(summary)
Best Practices
Memory Optimization
# Read large files in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
processed_chunk = chunk.process() # Your processing
chunks.append(processed_chunk)
df = pd.concat(chunks)
# Optimize data types
df['int_col'] = df['int_col'].astype('int32') # Instead of int64
df['float_col'] = df['float_col'].astype('float32') # Instead of float64
Performance Tips
# Use vectorized operations instead of loops
# Bad
result = []
for value in df['column']:
result.append(value * 2)
# Good
result = df['column'] * 2
# Use .query() for filtering
df_filtered = df.query('age > 30 and city == "NYC"')
# Use .loc for setting values
df.loc[df['age'] > 30, 'category'] = 'senior'
Reproducibility
# Set random seed
np.random.seed(42)
# Save processed data
df.to_csv('processed_data.csv', index=False)
df.to_parquet('processed_data.parquet') # Better for large datasets
# Export analysis
import pickle
with open('analysis_results.pkl', 'wb') as f:
pickle.dump({'stats': stats, 'model': model}, f)
Common Analysis Types
Sales Analysis
# Total sales by period
sales_by_month = df.groupby(df['date'].dt.to_period('M'))['sales'].sum()
# Top products
top_products = df.groupby('product')['sales'].sum().sort_values(ascending=False).head(10)
# Growth rate
df['growth_rate'] = df['sales'].pct_change() * 100
Customer Analysis
# Customer segmentation
df['segment'] = pd.cut(df['total_purchases'],
bins=[0, 100, 500, float('inf')],
labels=['Low', 'Medium', 'High'])
# Retention analysis
cohort = df.groupby(['cohort_month', 'purchase_month']).size()
Performance Analysis
# Year-over-year comparison
df['year'] = df['date'].dt.year
yoy = df.groupby('year')['metric'].sum()
yoy_growth = yoy.pct_change() * 100
Error Handling
try:
df = pd.read_csv('data.csv')
except FileNotFoundError:
print("Error: File not found")
sys.exit(1)
except pd.errors.EmptyDataError:
print("Error: File is empty")
sys.exit(1)
except Exception as e:
print(f"Error loading data: {e}")
sys.exit(1)
# Validate data
assert not df.empty, "DataFrame is empty"
assert 'required_column' in df.columns, "Missing required column"
assert df['date'].dtype == 'datetime64[ns]', "Date column not in datetime format"
Output Guidelines
Always provide:
- Summary: High-level findings in plain language
- Statistics: Key numbers and metrics
- Visualizations: Charts that support findings
- Insights: Actionable conclusions
- Recommendations: Next steps based on analysis
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Team Composition Analysis
This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.
Startup Financial Modeling
This skill should be used when the user asks to "create financial projections", "build a financial model", "forecast revenue", "calculate burn rate", "estimate runway", "model cash flow", or requests 3-5 year financial planning for a startup.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Startup Metrics Framework
This skill should be used when the user asks about "key startup metrics", "SaaS metrics", "CAC and LTV", "unit economics", "burn multiple", "rule of 40", "marketplace metrics", or requests guidance on tracking and optimizing business performance metrics.
