Data Science
by PetriLahdelma
Expert data science guidance for analytics, data processing, visualization, statistical analysis, machine learning, and AI integration. Use when analyzing data, building ML models, creating visualizations, processing datasets, conducting A/B tests, optimizing metrics, or integrating AI features. Includes Python (pandas, scikit-learn), data pipelines, and model deployment.
Skill Details
Repository Files
1 file in this skill directory
name: data-science description: Expert data science guidance for analytics, data processing, visualization, statistical analysis, machine learning, and AI integration. Use when analyzing data, building ML models, creating visualizations, processing datasets, conducting A/B tests, optimizing metrics, or integrating AI features. Includes Python (pandas, scikit-learn), data pipelines, and model deployment. allowed-tools: Read, Bash(python:), Bash(pip:), Bash(jupyter:), Bash(npx:)
Data Science Expert
Core Data Science Principles
When working with data and AI, always follow these principles:
- Data Quality First: Garbage in, garbage out (GIGO)
- Reproducibility: Code, environment, and random seeds must be version-controlled
- Interpretability: Prefer explainable models over black boxes when possible
- Bias Awareness: Audit for demographic, sampling, and algorithmic bias
- Privacy & Security: Never log PII, use encryption, follow GDPR/compliance
- Validation: Test on held-out data, avoid data leakage
Data Analysis Stack
Python Ecosystem
# Core Libraries
import pandas as pd # Data manipulation
import numpy as np # Numerical computing
import matplotlib.pyplot as plt # Visualization
import seaborn as sns # Statistical visualization
# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Deep Learning (optional)
import torch # PyTorch
# OR
import tensorflow as tf # TensorFlow
# AI/LLM Integration
from openai import OpenAI # OpenAI API
# OR
import anthropic # Claude API
JavaScript/TypeScript (Web Analytics)
// Analytics Events
import { track } from '@/lib/analytics';
track('button_clicked', {
component: 'CTA',
location: 'homepage',
variant: 'primary'
});
// Visualization
import { Chart } from 'chart.js';
import * as d3 from 'd3'; // For custom visualizations
Data Processing with Pandas
Loading Data
import pandas as pd
# CSV
df = pd.read_csv('data.csv')
# JSON (e.g., analytics export)
df = pd.read_json('analytics.json')
# Excel
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM events', conn)
# API (e.g., Google Analytics)
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())
Data Cleaning
# Inspect data
df.info() # Column types, missing values
df.describe() # Summary statistics
df.head() # First 5 rows
# Handle missing values
df.isnull().sum() # Count nulls per column
df.dropna() # Remove rows with nulls
df.fillna(0) # Fill nulls with 0
df['column'].fillna(df['column'].mean()) # Fill with mean
# Remove duplicates
df.drop_duplicates()
# Fix data types
df['date'] = pd.to_datetime(df['date'])
df['count'] = df['count'].astype(int)
# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
# Filter rows
df[df['value'] > 100] # Boolean indexing
df.query('value > 100') # SQL-like syntax
Data Transformation
# Create new columns
df['total'] = df['price'] * df['quantity']
# Apply functions
df['upper_name'] = df['name'].apply(str.upper)
# Group and aggregate
df.groupby('category')['sales'].sum()
df.groupby('category').agg({
'sales': 'sum',
'quantity': 'mean',
'price': ['min', 'max']
})
# Pivot tables
df.pivot_table(
values='sales',
index='month',
columns='category',
aggfunc='sum'
)
# Merge datasets
df_merged = pd.merge(df1, df2, on='id', how='left')
Exploratory Data Analysis (EDA)
Statistical Summaries
import pandas as pd
import numpy as np
# Central tendency
df['column'].mean() # Average
df['column'].median() # Middle value
df['column'].mode() # Most frequent
# Spread
df['column'].std() # Standard deviation
df['column'].var() # Variance
df['column'].quantile([0.25, 0.5, 0.75]) # Quartiles
# Distribution
df['column'].skew() # Asymmetry (-1 to 1)
df['column'].kurtosis() # Tail heaviness
# Correlation
df.corr() # Correlation matrix
df['col1'].corr(df['col2']) # Pairwise correlation
Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
# Distribution plot
sns.histplot(df['value'], kde=True)
plt.title('Value Distribution')
plt.show()
# Box plot (outlier detection)
sns.boxplot(x='category', y='value', data=df)
plt.title('Value by Category')
plt.show()
# Scatter plot (relationships)
sns.scatterplot(x='feature1', y='feature2', hue='category', data=df)
plt.title('Feature Relationship')
plt.show()
# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')
plt.show()
# Time series
df.set_index('date')['value'].plot()
plt.title('Value Over Time')
plt.show()
A/B Testing & Statistical Significance
Chi-Square Test (Categorical)
Use Case: Test if conversion rates differ between variants
import scipy.stats as stats
# Data
# Converted Not Converted
# Variant A: 120 880
# Variant B: 150 850
observed = [[120, 880], [150, 850]]
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print(f'Chi-squared: {chi2:.4f}')
print(f'P-value: {p_value:.4f}')
if p_value < 0.05:
print('Statistically significant difference (reject null hypothesis)')
else:
print('No significant difference (fail to reject null hypothesis)')
T-Test (Continuous)
Use Case: Test if average session duration differs
from scipy.stats import ttest_ind
# Data
variant_a = [45, 52, 48, 60, 55, ...] # Session durations
variant_b = [50, 58, 62, 65, 70, ...]
t_stat, p_value = ttest_ind(variant_a, variant_b)
print(f'T-statistic: {t_stat:.4f}')
print(f'P-value: {p_value:.4f}')
if p_value < 0.05:
print('Statistically significant difference')
Sample Size Calculator
from statsmodels.stats.power import zt_ind_solve_power
# Calculate required sample size
effect_size = 0.1 # 10% lift
alpha = 0.05 # Significance level
power = 0.8 # Statistical power (80%)
sample_size = zt_ind_solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
print(f'Required sample size per variant: {int(sample_size)}')
Machine Learning Workflow
1. Data Preparation
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load data
df = pd.read_csv('data.csv')
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']
# Handle categorical variables
le = LabelEncoder()
X['category'] = le.fit_transform(X['category'])
# Handle missing values
X.fillna(X.mean(), inplace=True)
# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Feature scaling (important for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
2. Model Training (Classification)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Random Forest (good baseline)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Logistic Regression (interpretable)
lr = LogisticRegression(random_state=42)
lr.fit(X_train_scaled, y_train)
# Support Vector Machine (complex boundaries)
svm = SVC(kernel='rbf', random_state=42)
svm.fit(X_train_scaled, y_train)
3. Model Evaluation
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report
)
# Predictions
y_pred = rf.predict(X_test)
# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()
# Detailed report
print(classification_report(y_test, y_pred))
4. Feature Importance
import matplotlib.pyplot as plt
# For tree-based models
importances = rf.feature_importances_
features = X.columns
# Sort by importance
indices = np.argsort(importances)[::-1]
# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), features[indices], rotation=45)
plt.title('Feature Importances')
plt.tight_layout()
plt.show()
# Top 5 features
for i in range(5):
print(f'{features[indices[i]]}: {importances[indices[i]]:.4f}')
5. Model Persistence
import joblib
# Save model
joblib.dump(rf, 'model.pkl')
# Load model
loaded_model = joblib.load('model.pkl')
# Make predictions
new_data = [[1, 2, 3, 4, 5]]
prediction = loaded_model.predict(new_data)
AI/LLM Integration
OpenAI API (GPT-4)
from openai import OpenAI
import os
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
# Chat completion
response = client.chat.completions.create(
model='gpt-4',
messages=[
{'role': 'system', 'content': 'You are a helpful design system assistant.'},
{'role': 'user', 'content': 'Suggest 5 color names for a tech startup.'}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
Anthropic API (Claude)
import anthropic
import os
client = anthropic.Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))
# Message creation
message = client.messages.create(
model='claude-sonnet-4-20250514',
max_tokens=1024,
messages=[
{'role': 'user', 'content': 'Explain design tokens in simple terms.'}
]
)
print(message.content[0].text)
Embeddings for Semantic Search
from openai import OpenAI
client = OpenAI()
# Create embeddings
def get_embedding(text, model='text-embedding-3-small'):
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
# Example: Search documentation
docs = [
'Design tokens are reusable design decisions.',
'Components are UI building blocks.',
'Accessibility ensures inclusive experiences.'
]
# Generate embeddings
embeddings = [get_embedding(doc) for doc in docs]
# Query
query = 'What are design tokens?'
query_embedding = get_embedding(query)
# Calculate similarity (cosine)
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], embeddings)[0]
best_match_idx = np.argmax(similarities)
print(f'Best match: {docs[best_match_idx]}')
print(f'Similarity: {similarities[best_match_idx]:.4f}')
Web Analytics Integration
Google Analytics 4 (GA4) API
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import (
RunReportRequest,
Dimension,
Metric,
DateRange
)
# Initialize client
client = BetaAnalyticsDataClient()
# Run report
request = RunReportRequest(
property=f'properties/{PROPERTY_ID}',
dimensions=[
Dimension(name='pagePath'),
Dimension(name='deviceCategory')
],
metrics=[
Metric(name='sessions'),
Metric(name='screenPageViews'),
Metric(name='averageSessionDuration')
],
date_ranges=[DateRange(start_date='30daysAgo', end_date='today')]
)
response = client.run_report(request)
# Process results
data = []
for row in response.rows:
data.append({
'page': row.dimension_values[0].value,
'device': row.dimension_values[1].value,
'sessions': int(row.metric_values[0].value),
'pageviews': int(row.metric_values[1].value),
'avg_duration': float(row.metric_values[2].value)
})
df = pd.DataFrame(data)
print(df.head())
Vercel Analytics API
// Track custom events in Next.js
import { track } from '@vercel/analytics';
track('service_page_viewed', {
service: 'design-system-lift-off',
source: 'homepage-cta'
});
// Track conversions
track('consultation_booked', {
service: 'ai-ux-sprint',
value: 12000 // Revenue (optional)
});
Data Visualization Best Practices
Choosing Chart Types
| Data Type | Question | Chart Type |
|---|---|---|
| Single Value | What's the total? | Big Number, KPI Card |
| Comparison | How do categories compare? | Bar Chart (horizontal for long labels) |
| Distribution | What's the range? | Histogram, Box Plot |
| Trend | How does it change over time? | Line Chart, Area Chart |
| Part-to-Whole | What's the composition? | Pie Chart (3-5 slices), Stacked Bar |
| Relationship | How do two variables relate? | Scatter Plot |
| Ranking | What's the order? | Sorted Bar Chart |
| Geographic | Where is it happening? | Map, Choropleth |
Chart.js (Web)
import { Chart } from 'chart.js/auto';
const ctx = document.getElementById('myChart') as HTMLCanvasElement;
const chart = new Chart(ctx, {
type: 'line',
data: {
labels: ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
datasets: [{
label: 'Website Traffic',
data: [1200, 1900, 1500, 2200, 2800],
borderColor: 'rgb(0, 102, 204)',
backgroundColor: 'rgba(0, 102, 204, 0.1)',
tension: 0.3
}]
},
options: {
responsive: true,
plugins: {
title: {
display: true,
text: 'Monthly Website Traffic'
}
},
scales: {
y: {
beginAtZero: true
}
}
}
});
D3.js (Custom Visualizations)
import * as d3 from 'd3';
// Simple bar chart
const data = [30, 80, 45, 60, 20, 90, 50];
const width = 400;
const height = 200;
const svg = d3.select('#chart')
.append('svg')
.attr('width', width)
.attr('height', height);
const x = d3.scaleBand()
.domain(data.map((d, i) => i.toString()))
.range([0, width])
.padding(0.1);
const y = d3.scaleLinear()
.domain([0, d3.max(data) || 0])
.range([height, 0]);
svg.selectAll('rect')
.data(data)
.enter()
.append('rect')
.attr('x', (d, i) => x(i.toString()) || 0)
.attr('y', d => y(d))
.attr('width', x.bandwidth())
.attr('height', d => height - y(d))
.attr('fill', 'steelblue');
Data Pipeline Architecture
ETL Pattern (Extract, Transform, Load)
import pandas as pd
import requests
from datetime import datetime
def extract():
"""Extract data from source"""
# From API
response = requests.get('https://api.example.com/data')
data = response.json()
return pd.DataFrame(data)
def transform(df):
"""Clean and transform data"""
# Remove nulls
df = df.dropna()
# Fix types
df['date'] = pd.to_datetime(df['date'])
df['value'] = df['value'].astype(float)
# Add computed columns
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
# Filter
df = df[df['value'] > 0]
return df
def load(df, output_path):
"""Load data to destination"""
# To CSV
df.to_csv(output_path, index=False)
# OR to database
# df.to_sql('table_name', conn, if_exists='replace')
# Run pipeline
if __name__ == '__main__':
print('Starting ETL pipeline...')
# Extract
raw_data = extract()
print(f'Extracted {len(raw_data)} rows')
# Transform
clean_data = transform(raw_data)
print(f'Transformed to {len(clean_data)} rows')
# Load
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output = f'data/processed_{timestamp}.csv'
load(clean_data, output)
print(f'Loaded to {output}')
Scheduled Jobs (cron)
# schedule_analytics.py
import schedule
import time
def fetch_daily_analytics():
"""Fetch and process analytics data"""
print('Fetching analytics...')
# Extract, transform, load
# ...
# Schedule daily at 2 AM
schedule.every().day.at('02:00').do(fetch_daily_analytics)
# Run scheduler
while True:
schedule.run_pending()
time.sleep(60)
Privacy & Security
Data Anonymization
import hashlib
def anonymize_email(email):
"""Hash email for privacy"""
return hashlib.sha256(email.encode()).hexdigest()
# Example
df['email_hash'] = df['email'].apply(anonymize_email)
df.drop('email', axis=1, inplace=True) # Remove PII
GDPR Compliance Checklist
- Consent: Explicit opt-in for analytics tracking
- Right to Access: API for users to request their data
- Right to Deletion: Ability to purge user data
- Data Minimization: Only collect necessary data
- Encryption: Encrypt data at rest and in transit
- Audit Logs: Track who accessed what data and when
- Data Retention: Auto-delete after retention period
Rate Limiting (API Protection)
from functools import wraps
from time import time
# Simple rate limiter
rate_limit = {}
def rate_limit_decorator(max_calls=10, period=60):
"""Allow max_calls per period (seconds)"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
now = time()
key = func.__name__
if key not in rate_limit:
rate_limit[key] = []
# Remove old calls outside period
rate_limit[key] = [t for t in rate_limit[key] if now - t < period]
if len(rate_limit[key]) >= max_calls:
raise Exception(f'Rate limit exceeded: {max_calls} calls per {period}s')
rate_limit[key].append(now)
return func(*args, **kwargs)
return wrapper
return decorator
@rate_limit_decorator(max_calls=5, period=60)
def call_ai_api(prompt):
# API call
pass
Model Deployment (Production)
Serverless Function (Vercel)
# api/predict.py
import joblib
import json
from http.server import BaseHTTPRequestHandler
# Load model once (cold start)
model = joblib.load('model.pkl')
class handler(BaseHTTPRequestHandler):
def do_POST(self):
# Read request body
content_length = int(self.headers['Content-Length'])
body = self.rfile.read(content_length)
data = json.loads(body)
# Make prediction
features = data['features']
prediction = model.predict([features])[0]
probability = model.predict_proba([features])[0].tolist()
# Return response
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
response = {
'prediction': int(prediction),
'probability': probability
}
self.wfile.write(json.dumps(response).encode())
Docker Container
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and code
COPY model.pkl .
COPY app.py .
# Expose port
EXPOSE 8000
# Run app
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# app.py
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load('model.pkl')
@app.post('/predict')
async def predict(features: list[float]):
prediction = model.predict([features])[0]
return {'prediction': int(prediction)}
Metrics for Design Systems
Component Usage Analytics
import pandas as pd
# Sample data: component usage logs
data = {
'component': ['Button', 'Card', 'Button', 'Input', 'Card', 'Button'],
'page': ['home', 'about', 'contact', 'contact', 'work', 'home'],
'timestamp': ['2025-01-01', '2025-01-01', '2025-01-02', '2025-01-02', '2025-01-03', '2025-01-03']
}
df = pd.DataFrame(data)
# Top components
component_usage = df['component'].value_counts()
print('Top Components:')
print(component_usage)
# Usage by page
page_usage = df.groupby(['page', 'component']).size().unstack(fill_value=0)
print('\nUsage by Page:')
print(page_usage)
# Growth over time
df['date'] = pd.to_datetime(df['timestamp'])
daily_usage = df.groupby('date').size()
print('\nDaily Usage:')
print(daily_usage)
Design Token Adoption
# Track which tokens are actually used in production
import re
def extract_tokens_from_css(css_content):
"""Extract CSS variable usage from stylesheets"""
pattern = r'var\((--[a-zA-Z0-9-]+)\)'
tokens = re.findall(pattern, css_content)
return tokens
# Example
css = """
.button {
background-color: var(--color-primary);
padding: var(--space-16);
border-radius: var(--radius-md);
}
"""
tokens_used = extract_tokens_from_css(css)
print(f'Tokens used: {set(tokens_used)}')
When to Use This Skill
Activate this skill when:
- Analyzing user behavior or analytics data
- Conducting A/B tests or experiments
- Building machine learning models
- Processing large datasets
- Creating data visualizations
- Integrating AI/LLM features
- Designing data pipelines
- Calculating statistical significance
- Extracting insights from logs
- Forecasting trends or metrics
- Optimizing conversion funnels
- Measuring design system adoption
- Deploying ML models to production
- Ensuring GDPR/privacy compliance
Remember: Always validate assumptions with data, test models rigorously, and prioritize user privacy and security.
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
