Data Exploration
by profmikegreene
Explore and analyze pilot data sets to uncover patterns, anomalies, and initial insights. Use when performing ad-hoc data investigations, validating data quality, or preparing exploratory visualizations for hypothesis generation.
Skill Details
Repository Files
2 files in this skill directory
name: data-exploration description: Explore and analyze pilot data sets to uncover patterns, anomalies, and initial insights. Use when performing ad-hoc data investigations, validating data quality, or preparing exploratory visualizations for hypothesis generation.
Data Exploration
Overview
The data-exploration skill enables systematic investigation of datasets to uncover patterns, validate quality, and surface insights that inform decision-making. This skill is used when you need to understand a new dataset, validate data quality before analysis, or generate initial findings for hypothesis formation.
Quick Start
Use this skill when you need to:
- Understand a new dataset: Get initial feel for structure, quality, and contents
- Validate data quality: Check for completeness, accuracy, and consistency
- Find patterns: Identify trends, correlations, or anomalies
- Generate insights: Surface findings that could inform strategic decisions
- Prepare for hypothesis testing: Establish baseline understanding before formal analysis
Workflow Decision Tree
New Dataset → Data Profiling → Quality Assessment → Pattern Discovery → Insight Generation
↓ ↓ ↓ ↓
[Statistical Summary] [Quality Report] [Visual Analysis] [Intelligence Brief]
Data Profiling
Step 1: Initial Assessment
When you receive a new dataset, start with basic profiling:
import pandas as pd
import numpy as np
def profile_dataset(df: pd.DataFrame) -> dict:
"""Generate initial dataset profile."""
profile = {
"shape": df.shape,
"columns": list(df.columns),
"dtypes": df.dtypes.to_dict(),
"memory_usage": df.memory_usage(deep=True).sum() / 1024**2, # MB
"missing_values": df.isnull().sum().to_dict(),
"duplicate_rows": df.duplicated().sum()
}
return profile
Output: Dataset profile with shape, columns, types, memory usage, missing values
Step 2: Statistical Summary
Generate descriptive statistics for numerical and categorical columns:
def generate_statistical_summary(df: pd.DataFrame) -> dict:
"""Create statistical summary of dataset."""
summary = {}
# Numerical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
summary['numerical'] = df[numeric_cols].describe().to_dict()
# Categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
if len(categorical_cols) > 0:
summary['categorical'] = {
col: {
'unique_values': df[col].nunique(),
'top_5_values': df[col].value_counts().head(5).to_dict(),
'missing': df[col].isnull().sum()
}
for col in categorical_cols
}
# Temporal columns
datetime_cols = df.select_dtypes(include=['datetime64']).columns
if len(datetime_cols) > 0:
summary['temporal'] = {
col: {
'min': df[col].min(),
'max': df[col].max(),
'range_days': (df[col].max() - df[col].min()).days
}
for col in datetime_cols
}
return summary
Output: Statistical summary organized by column type
Quality Assessment
Data Quality Checks
Apply systematic quality checks to identify issues:
def assess_data_quality(df: pd.DataFrame) -> dict:
"""Assess data quality and identify issues."""
quality_report = {
"completeness": {},
"consistency": {},
"accuracy": {},
"issues": []
}
# Completeness: Missing values
missing_pct = (df.isnull().sum() / len(df) * 100).to_dict()
quality_report["completeness"] = {
col: f"{pct:.2f}%"
for col, pct in missing_pct.items() if pct > 0
}
# Flag columns with >20% missing
for col, pct in missing_pct.items():
if pct > 20:
quality_report["issues"].append({
"type": "completeness",
"severity": "high" if pct > 50 else "medium",
"column": col,
"message": f"{col} has {pct:.2f}% missing values"
})
# Consistency: Duplicate rows
dup_count = df.duplicated().sum()
if dup_count > 0:
dup_pct = dup_count / len(df) * 100
quality_report["consistency"]["duplicate_rows"] = f"{dup_count} ({dup_pct:.2f}%)"
if dup_pct > 5:
quality_report["issues"].append({
"type": "consistency",
"severity": "high" if dup_pct > 10 else "medium",
"message": f"{dup_count} duplicate rows found ({dup_pct:.2f}%)"
})
# Consistency: Data type mismatches
for col in df.columns:
if df[col].dtype == 'object':
# Check if numeric values are stored as strings
try:
pd.to_numeric(df[col], errors='coerce')
numeric_pct = df[col].apply(lambda x: str(x).replace('.', '').replace('-', '').isdigit()).sum() / len(df)
if numeric_pct > 0.8: # If >80% could be numeric
quality_report["issues"].append({
"type": "consistency",
"severity": "low",
"column": col,
"message": f"{col} appears to contain numeric values but is stored as text"
})
except:
pass
# Accuracy: Outliers in numerical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 3*IQR) | (df[col] > Q3 + 3*IQR)][col]
if len(outliers) > 0:
outlier_pct = len(outliers) / len(df) * 100
quality_report["accuracy"][col] = f"{len(outliers)} outliers ({outlier_pct:.2f}%)"
if outlier_pct > 5:
quality_report["issues"].append({
"type": "accuracy",
"severity": "medium" if outlier_pct < 10 else "high",
"column": col,
"message": f"{col} has {len(outliers)} potential outliers ({outlier_pct:.2f}%)"
})
return quality_report
Output: Quality report with completeness, consistency, accuracy issues
Duke Data Governance Alignment
When working with Duke data, ensure alignment with data governance standards:
Check:
- Data classification identified (Public, Internal, Confidential, Restricted)
- Privacy requirements assessed (PII, FERPA, HIPAA)
- Data retention policy noted
- Access controls documented
- Data lineage tracked
Reference: references/duke_data_governance.md
Pattern Discovery
Visual Analysis
Create exploratory visualizations to identify patterns:
import matplotlib.pyplot as plt
import seaborn as sns
def explore_patterns_visual(df: pd.DataFrame, output_dir: str = "./outputs"):
"""Generate exploratory visualizations."""
# Distribution plots for numerical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
fig, axes = plt.subplots(len(numeric_cols), 1, figsize=(10, 3*len(numeric_cols)))
if len(numeric_cols) == 1:
axes = [axes]
for idx, col in enumerate(numeric_cols):
axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black')
axes[idx].set_title(f'Distribution of {col}')
axes[idx].set_xlabel(col)
axes[idx].set_ylabel('Frequency')
plt.tight_layout()
plt.savefig(f"{output_dir}/distributions.png")
plt.close()
# Correlation heatmap
if len(numeric_cols) > 1:
plt.figure(figsize=(10, 8))
correlation = df[numeric_cols].corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig(f"{output_dir}/correlation_heatmap.png")
plt.close()
# Time series if datetime column exists
datetime_cols = df.select_dtypes(include=['datetime64']).columns
if len(datetime_cols) > 0 and len(numeric_cols) > 0:
for date_col in datetime_cols:
for num_col in numeric_cols[:3]: # First 3 numeric columns
plt.figure(figsize=(12, 4))
df.sort_values(date_col).plot(x=date_col, y=num_col, ax=plt.gca())
plt.title(f'{num_col} over time')
plt.tight_layout()
plt.savefig(f"{output_dir}/timeseries_{num_col}.png")
plt.close()
Output: Distribution plots, correlation heatmap, time series visualizations
Anomaly Detection
Identify unusual patterns or outliers:
from sklearn.ensemble import IsolationForest
def detect_anomalies(df: pd.DataFrame, contamination: float = 0.1) -> pd.DataFrame:
"""Detect anomalies using Isolation Forest."""
# Select numerical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) == 0:
return pd.DataFrame()
# Prepare data
X = df[numeric_cols].fillna(df[numeric_cols].median())
# Detect anomalies
clf = IsolationForest(contamination=contamination, random_state=42)
anomaly_labels = clf.fit_predict(X)
# Add anomaly flag to dataframe
df_with_anomalies = df.copy()
df_with_anomalies['is_anomaly'] = anomaly_labels == -1
return df_with_anomalies[df_with_anomalies['is_anomaly']]
Output: Dataframe containing detected anomalies
Insight Generation
Generate Intelligence Brief
Synthesize findings into actionable intelligence:
Template Structure:
# Data Exploration Report: [Dataset Name]
**Date**: [YYYY-MM-DD]
**Analyst**: Mayuri - Captain of Intelligence
**Dataset**: [description]
## Executive Summary
[2-3 sentences: Key findings and recommendations]
## Dataset Overview
- **Rows**: [count]
- **Columns**: [count]
- **Time Period**: [date range if applicable]
- **Data Quality Score**: [High/Medium/Low]
## Key Findings
### 1. [Finding Title]
**Observation**: [What was found]
**Evidence**: [Supporting statistics/visualizations]
**Implication**: [What this means for Duke]
### 2. [Finding Title]
...
## Data Quality Assessment
- **Completeness**: [summary of missing values]
- **Consistency**: [summary of duplicates, format issues]
- **Accuracy**: [summary of outliers, validation results]
### Issues Requiring Attention
- [ ] **High Priority**: [Issue description and recommended action]
- [ ] **Medium Priority**: [Issue description and recommended action]
## Patterns & Trends
- [Pattern 1]: [Description with supporting data]
- [Pattern 2]: [Description with supporting data]
## Anomalies Detected
- [Anomaly 1]: [Description, potential cause, recommended follow-up]
- [Anomaly 2]: [Description, potential cause, recommended follow-up]
## Recommendations
1. [Action item with rationale]
2. [Action item with rationale]
3. [Action item with rationale]
## Next Steps
- [ ] [Follow-up analysis needed]
- [ ] [Data quality improvements]
- [ ] [Stakeholder review required]
## Appendix
- Data Governance Classification: [Public/Internal/Confidential/Restricted]
- Privacy Considerations: [PII/FERPA/HIPAA notes]
- Data Sources: [List with access dates]
Duke-Specific Considerations
Alignment with 2026 Goals
When exploring data, always consider alignment with Duke's 2026 goals:
-
Goal: Improve student service response times by 30%
- Data to explore: Service desk tickets, response times, resolution times
- Look for: Patterns in request types, peak times, staff utilization
-
Goal: Enhance research computing capabilities
- Data to explore: HPC usage, grant activity, researcher feedback
- Look for: Utilization trends, capacity constraints, service gaps
OIT Priority Alignment
Connect findings to OIT priorities:
- Infrastructure modernization
- Security posture improvement
- User experience enhancement
- Operational efficiency
Reference: references/duke_2026_goals.md
Common Use Cases
Use Case 1: Pilot Data Validation
Scenario: Strategy Captain requests validation of data for AI service desk pilot
Workflow:
- Load service desk ticket data
- Profile dataset (shape, columns, types)
- Assess quality (missing values, duplicates, consistency)
- Explore patterns (common issues, peak times, resolution patterns)
- Detect anomalies (unusual tickets, outlier response times)
- Generate intelligence brief for Strategy Captain
Use Case 2: Hypothesis Support
Scenario: Need data to support hypothesis about student service improvement
Workflow:
- Load relevant datasets (tickets, surveys, usage logs)
- Calculate baseline metrics (current response times, satisfaction scores)
- Identify patterns (what's working, what's not)
- Find correlations (factors affecting response time)
- Generate findings to inform hypothesis testing
Use Case 3: Early Warning Detection
Scenario: Monitor ongoing pilot for data quality degradation
Workflow:
- Establish baseline data quality metrics
- Set monitoring thresholds
- Regular quality checks (daily/weekly)
- Detect anomalies and trends
- Generate early warning signal if thresholds crossed
Tools & Libraries
Required Python Libraries
pip install pandas numpy matplotlib seaborn scikit-learn
Recommended Tools
- Pandas: Data manipulation and analysis
- NumPy: Numerical operations
- Matplotlib/Seaborn: Visualization
- Scikit-learn: Anomaly detection, statistical analysis
- Jupyter: Interactive exploration
Resources
scripts/
- explore_dataset.py: Main exploration script
- quality_checks.py: Data quality assessment utilities
- anomaly_detection.py: Anomaly detection implementations
references/
- duke_data_governance.md: Duke data governance standards
- duke_2026_goals.md: 2026 strategic goals
- privacy_requirements.md: Privacy and compliance requirements
- statistical_methods.md: Reference for statistical techniques
assets/
- exploration_template.ipynb: Jupyter notebook template
- quality_report_template.md: Quality report template
- intelligence_brief_template.md: Brief template
Best Practices
- Start Broad, Then Focus: Begin with overall profiling, then drill into interesting areas
- Document Assumptions: Note any assumptions made during exploration
- Visualize Early: Create visualizations to spot patterns humans can see but stats might miss
- Check Quality First: Don't analyze bad data - validate quality before deep analysis
- Align with Goals: Always connect findings back to Duke's strategic objectives
- Maintain Provenance: Log data sources, access dates, and transformations applied
- Privacy First: Be mindful of PII and sensitive data throughout exploration
- Iterate: Exploration is iterative - new findings lead to new questions
Decision Log Integration
Every data exploration should generate a decision log entry:
from agents.mayuri_intelligence import MayuriIntelligence
mayuri = MayuriIntelligence()
# After exploration
mayuri.log_decision(
decision_type="data_exploration_completed",
inputs={
"dataset": "service_desk_tickets_2025.csv",
"rows": 15234,
"date_range": "2025-01-01 to 2025-12-31"
},
outputs={
"quality_score": "Medium",
"issues_found": 3,
"patterns_identified": 5,
"brief_path": "./outputs/intelligence/BRIEF-20260105-143022.md"
},
rationale="Explored service desk data to establish baseline for AI chatbot pilot",
metadata={
"alignment": "2026 Goal: Improve student service response times",
"stakeholder": "Strategy Captain (Kisuke)"
}
)
Remember
- Quality over speed: Thorough exploration prevents bad decisions later
- Context matters: Duke's environment, goals, and constraints shape what's important
- Communicate findings: Insights are useless if not shared in actionable format
- Stay curious: The best insights often come from unexpected patterns
- Respect privacy: Duke data governance and privacy standards are non-negotiable
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
