Data Science Tools
by mikkelkrogsholm
Documentation of available data science libraries (scipy, numpy, pandas, sklearn) and best practices for statistical analysis, regression modeling, and organizing analysis scripts. **CRITICAL:** All analysis scripts MUST be placed in reports/{topic}/scripts/, NOT in root scripts/ directory.
Skill Details
Repository Files
1 file in this skill directory
name: data-science-tools description: Documentation of available data science libraries (scipy, numpy, pandas, sklearn) and best practices for statistical analysis, regression modeling, and organizing analysis scripts. CRITICAL: All analysis scripts MUST be placed in reports/{topic}/scripts/, NOT in root scripts/ directory.
Data Science Tools Skill
Purpose
This skill documents the data science ecosystem available in this project, including:
- Which Python libraries are installed and available
- How to use them for statistical analysis and regression
- WHERE to place analysis scripts (reports/{topic}/scripts/ - NOT root scripts/)
- Best practices for reproducible data science
π¨ CRITICAL: Script Organization Rule
ALL regression, modeling, and analysis scripts MUST go in:
reports/{topic}_{timestamp}/scripts/
NEVER in:
scripts/ β (root scripts/ is only for reusable utilities)
See Script Organization Best Practices section below.
Available Libraries
Installed in .venv Virtual Environment
The following data science libraries are installed and ready to use:
| Library | Version | Purpose |
|---|---|---|
| numpy | Latest | Numerical computing, arrays, linear algebra |
| scipy | 1.16.3+ | Scientific computing, optimization, statistics |
| pandas | 2.3.3+ | Data manipulation, DataFrames, time series |
| scikit-learn | 1.7.2+ | Machine learning, regression, clustering |
Activating the Virtual Environment
All Python scripts must use the virtual environment:
source .venv/bin/activate && python scripts/your_script.py
Or add shebang to scripts:
#!/usr/bin/env python3
# Then run directly: ./scripts/your_script.py
In Bash tool calls:
source .venv/bin/activate && python scripts/analysis.py
Common Use Cases
1. Regression Modeling (scipy.optimize.curve_fit)
Purpose: Fit non-linear models to data (S-curves, exponential, etc.)
Example: Logistic Regression
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
# Define model
def logistic(t, L, k, t0):
"""Logistic S-curve: L / (1 + exp(-k*(t - t0)))"""
return L / (1 + np.exp(-k * (t - t0)))
# Prepare data
years = np.array([1993, 1994, ...]) # Time points
shares = np.array([0.004, 0.005, ...]) # Observed values
t = years - 1993 # Normalize time
# Fit model with bounds
p0 = [80, 0.5, 30] # Initial guess: L=80%, k=0.5, t0=30
bounds = ([50, 0.1, 20], [100, 2.0, 50]) # Parameter bounds
params, covariance = curve_fit(
logistic, t, shares,
p0=p0,
bounds=bounds,
maxfev=10000
)
L, k, t0 = params
# Validate
predictions = logistic(t, L, k, t0)
r2 = r2_score(shares, predictions)
rmse = np.sqrt(np.mean((shares - predictions)**2))
print(f"Fitted parameters: L={L:.2f}, k={k:.4f}, t0={t0:.2f}")
print(f"RΒ² = {r2:.6f}, RMSE = {rmse:.4f}")
β οΈ Important: Always use curve_fit with:
- Initial guess (
p0) - Bounds on parameters (prevents unrealistic values)
maxfevto allow sufficient iterations
2. Model Comparison
Compare multiple models to find best fit:
models = {
'logistic': (logistic, [80, 0.5, 30], ([50, 0.1, 20], [100, 2.0, 50])),
'gompertz': (gompertz, [80, 0.2, 30], ([50, 0.05, 20], [100, 1.0, 50])),
}
results = {}
for name, (func, p0, bounds) in models.items():
params, _ = curve_fit(func, t, shares, p0=p0, bounds=bounds)
pred = func(t, *params)
r2 = r2_score(shares, pred)
results[name] = {'params': params, 'r2': r2}
# Find best
best_model = max(results.items(), key=lambda x: x[1]['r2'])
print(f"Best model: {best_model[0]} (RΒ² = {best_model[1]['r2']:.6f})")
3. Data Manipulation with Pandas
Read CSV, filter, aggregate:
import pandas as pd
# Read data
df = pd.read_csv('data/ev_annual_bil10.csv')
# Filter
recent = df[df['year'] >= 2015]
# Aggregate
yearly_avg = df.groupby('year')['ev_share_pct'].mean()
# Export
df.to_csv('data/results.csv', index=False)
4. Statistical Analysis
from scipy import stats
# Correlation
corr, p_value = stats.pearsonr(x, y)
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
# T-test
t_stat, p_value = stats.ttest_ind(group1, group2)
Script Organization Best Practices
Directory Structure
dst_skills/
βββ scripts/ # Reusable utilities ONLY
β βββ fetch_and_store.py
β βββ db/
β β βββ helpers.py
β βββ utils.py
β
βββ data/ # Raw data and databases
β βββ dst.db
β βββ *.csv
β
βββ reports/ # Generated reports
βββ {topic}_{timestamp}/
βββ report.html
βββ visualizations.html
βββ data/ # Report-specific intermediate data
β βββ *.csv
βββ scripts/ # β οΈ ALL analysis scripts go HERE
βββ README.md
βββ fit_models.py
βββ validate.py
βββ requirements.txt
IMPORTANT: Do NOT create analysis scripts in root scripts/ directory.
All regression, modeling, and analysis scripts must be in the report's scripts/ folder.
When to Place Scripts in reports/{topic}/scripts/ β
ALWAYS for Analysis
Use this for ALL report-specific analysis:
- Regression modeling (curve_fit, forecasting, etc.)
- Statistical analysis (hypothesis tests, correlations, etc.)
- Data transformation specific to this report
- Validation and model comparison
- Reproducibility - reader can re-run your exact analysis
- Documentation - shows exactly what was done
- Versioning - freezes code with report at time of publication
β ALL of these belong in reports/{topic}/scripts/:
fit_ev_models.py- Regression modelingvalidate_models.py- Model validationverify_regression_models.py- scipy verificationforecast_scenarios.py- Forecastingstatistical_tests.py- Hypothesis testing
Example structure:
reports/elbiler_danmark_20251031/
βββ report.html
βββ visualizations.html
βββ data/ # Intermediate data for THIS analysis
β βββ model_fits.csv
β βββ forecasts.csv
β βββ residuals.csv
βββ scripts/ # β
ALL analysis scripts here
βββ README.md # Explains how to reproduce
βββ fit_ev_models.py # Main regression analysis
βββ validate_models.py # Cross-validation
βββ verify_regression_models.py # scipy verification
βββ requirements.txt # Dependencies snapshot
When to Use scripts/ (Root Level) β οΈ ONLY for Reusable Utilities
Root scripts/ is ONLY for infrastructure utilities that are shared across ALL reports:
- Database utilities (
db/helpers.py,db/validate.py) - Data fetching (
fetch_and_store.py) - Generic helpers (
utils.py) - NOT for analysis - no regression, modeling, or statistics
β NEVER put these in root scripts/:
- Regression models
- Statistical analysis
- Data transformations
- Forecasting
- Model validation
β Root scripts/ should ONLY contain:
# scripts/db/helpers.py - OK (reusable DB utility)
def safe_numeric_cast(column_name):
"""Helper for casting DST suppressed values."""
return f"CASE WHEN {column_name} != '..' THEN CAST({column_name} AS NUMERIC) ELSE NULL END"
# scripts/utils.py - OK (generic utility)
def format_timestamp():
"""Standard timestamp format for filenames."""
return datetime.now().strftime('%Y%m%d_%H%M%S')
# scripts/fetch_and_store.py - OK (reusable infrastructure)
def fetch_dst_table(table_id, filters):
"""Fetch data from DST API and store in DuckDB."""
# ... implementation
If you're doing curve_fit, forecasting, or statistics β reports/{topic}/scripts/ β
Template: Report Analysis Script
#!/usr/bin/env python3
"""
EV Adoption Model Fitting and Validation
=========================================
Report: Danmarks Elbilsudvikling 2050
Date: 2025-10-31
Author: Claude Code
Purpose:
Fit multiple regression models to EV adoption data and compare.
Usage:
cd reports/{report_name}/scripts/
source ../../../.venv/bin/activate
python fit_ev_models.py
Outputs:
- ../data/model_parameters.csv
- ../data/forecasts.csv
- stdout: Model comparison table
"""
import sys
import os
import csv
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
def main():
# 1. Load data using relative path from scripts/ directory
script_dir = os.path.dirname(os.path.abspath(__file__))
project_root = os.path.join(script_dir, '../../..')
# Path to project-level data
data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')
print(f"Loading data from {data_path}...")
years = []
shares = []
with open(data_path, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
years.append(int(row['year']))
shares.append(float(row['ev_share_pct']))
years = np.array(years)
shares = np.array(shares)
# 2. Fit models
print("\nFitting models...")
# ... implementation
# 3. Save results to report's data/ directory
output_dir = os.path.join(script_dir, '../data')
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'model_parameters.csv')
print(f"\nSaving results to {output_path}...")
# ... save implementation
if __name__ == '__main__':
main()
Key points:
- Use
os.pathfor cross-platform compatibility - Always use relative paths from script's location
- Project data:
../../../data/ - Report data:
../data/ - Activate venv before running
README.md Template for Report Scripts
# Analysis Scripts for EV Adoption Report
## Report Details
- **Topic:** Danmarks Elbilsudvikling til 2050
- **Generated:** 2025-10-31
- **Data:** BIL10, BIL52, BIL51 (Danmarks Statistik)
## Reproducibility
### Prerequisites
```bash
# From project root
source .venv/bin/activate
pip install numpy scipy pandas scikit-learn
Run Analysis
cd reports/elbiler_danmark_20251031/scripts/
python fit_ev_models.py
python validate_models.py
Scripts
fit_ev_models.py- Fits logistic, Gompertz, exponential modelsvalidate_models.py- Cross-validation and residual analysisexport_forecasts.py- Generate 2026-2050 predictions
Outputs
Results saved to ../data/:
model_parameters.csv- Fitted parameters (L, k, t0)forecasts.csv- Year-by-year predictionsvalidation_metrics.csv- RΒ², RMSE, etc.
Model Details
See ../report.html Section 3: Methodology
## Common Pitfalls and Solutions
### 1. ModuleNotFoundError
**Problem:**
```bash
ModuleNotFoundError: No module named 'scipy'
Solution:
# Always activate venv first
source .venv/bin/activate
python scripts/your_script.py
2. curve_fit Fails to Converge
Problem:
OptimizeWarning: Covariance of the parameters could not be estimated
Solutions:
- Improve initial guess
p0 - Tighten bounds (e.g., L: [60, 90] instead of [50, 100])
- Increase
maxfevto 20000 - Normalize/scale your data first
- Try different optimization methods
# Better bounds
bounds = ([65, 0.3, 25], [95, 0.8, 40]) # Tighter
# Or use different method
from scipy.optimize import minimize, differential_evolution
3. Grid Search vs Optimization
Bad (inefficient):
best_r2 = 0
for L in [70, 75, 80, 85, 90, 95]:
for k in np.arange(0.1, 2.0, 0.05):
# ... fit and compare
Good (use scipy):
params, _ = curve_fit(logistic, t, shares, p0=[80, 0.5, 30])
When grid search is acceptable:
- Quick prototyping to find good
p0 - Testing specific scenarios (e.g., compare L=70% vs L=90%)
- Educational purposes
4. Overfitting
Warning signs:
- RΒ² > 0.999 on historical data
- Model fits noise, not signal
- Poor performance on holdout set
Solutions:
# Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, shuffle=False)
# Fit on train, validate on test
params, _ = curve_fit(model, train_t, train_y)
test_pred = model(test_t, *params)
test_r2 = r2_score(test_y, test_pred)
if test_r2 < 0.9:
print("β οΈ Warning: Poor generalization")
Installation and Verification
Check Installed Packages
source .venv/bin/activate
pip list | grep -E "(numpy|scipy|pandas|scikit)"
Expected output:
numpy 1.x.x
pandas 2.3.3
scikit-learn 1.7.2
scipy 1.16.3
Verify scipy.optimize Works
source .venv/bin/activate
python -c "from scipy.optimize import curve_fit; print('β scipy.optimize available')"
Install Missing Packages
source .venv/bin/activate
pip install numpy scipy pandas scikit-learn
Integration with DST Skills Workflow
Typical Workflow
- Discovery:
/dst-discoverβ Find tables - Fetch:
/dst-fetchβ Download data todata/ - Analysis:
/dst-analyzeβ SQL queries, basic calculations - Modeling: Create script in
reports/{topic}/scripts/for regression - Visualize:
/dst-visualizeβ Create charts from results - Report:
/dst-reportβ Generate HTML with all findings
Where Each Step Happens
| Step | Location | Examples |
|---|---|---|
| Data fetching | data/ |
dst.db, *.csv |
| SQL queries | Agent (ephemeral) | Aggregations, joins |
| Regression/modeling | reports/{topic}/scripts/ β
|
curve_fit, forecasting |
| Results | reports/{topic}/data/ |
model_parameters.csv |
| Report | reports/{topic}/ |
report.html |
Example: Complete Regression Analysis
Step 1: Create analysis script in report folder
File: reports/elbiler_danmark_20251031/scripts/fit_logistic_model.py
#!/usr/bin/env python3
"""
Fit logistic regression to EV adoption data.
Usage:
cd reports/elbiler_danmark_20251031/scripts/
source ../../../.venv/bin/activate
python fit_logistic_model.py
"""
import csv
import os
import numpy as np
from scipy.optimize import curve_fit
def main():
# Load data from project data/
script_dir = os.path.dirname(os.path.abspath(__file__))
project_root = os.path.join(script_dir, '../../..')
data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')
# 1. Load data
years = []
shares = []
with open(data_path, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
years.append(int(row['year']))
shares.append(float(row['ev_share_pct']))
years = np.array(years)
shares = np.array(shares)
t = years - years.min()
# 2. Define and fit model
def logistic(t, L, k, t0):
return L / (1 + np.exp(-k * (t - t0)))
params, _ = curve_fit(logistic, t, shares,
p0=[80, 0.5, 30],
bounds=([50, 0.1, 20], [100, 2.0, 50]))
L, k, t0 = params
# 3. Forecast
future_years = np.arange(years.max() + 1, 2051)
future_t = future_years - years.min()
forecast = logistic(future_t, L, k, t0)
# 4. Export to report's data/ folder
output_dir = os.path.join(script_dir, '../data')
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'forecast.csv')
with open(output_path, 'w') as f:
writer = csv.writer(f)
writer.writerow(['year', 'predicted_share'])
for year, pred in zip(future_years, forecast):
writer.writerow([year, pred])
print(f"β Forecast exported: {output_path}")
print(f" Model: L={L:.1f}%, k={k:.3f}, t0={t0:.1f}")
if __name__ == '__main__':
main()
Step 2: Run from report's scripts/ directory
cd reports/elbiler_danmark_20251031/scripts/
source ../../../.venv/bin/activate
python fit_logistic_model.py
Step 3: Use results in visualization and report
The forecast.csv is now in reports/elbiler_danmark_20251031/data/ and can be used by /dst-visualize and /dst-report.
β Benefits of this approach:
- Script stays with report (reproducibility)
- Relative paths work from any machine
- Clear separation: data fetching vs analysis vs reporting
- Easy to version control and share
References
Documentation
- scipy.optimize.curve_fit: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
- sklearn metrics: https://scikit-learn.org/stable/modules/model_evaluation.html
- pandas: https://pandas.pydata.org/docs/
Regression Theory
- Logistic growth: Bass diffusion model, technology adoption
- Gompertz curve: Asymmetric S-curve for market saturation
- Model selection: AIC, BIC, cross-validation
Best Practices
- Script placement: ALWAYS put analysis scripts in
reports/{topic}/scripts/ - Validation: Use train-test split for model validation
- Reporting: Always report RΒ², RMSE, and residual plots
- Documentation: Document assumptions and limitations in script docstrings
- Reproducibility: Version-control analysis scripts WITH the report they generate
- Data paths: Use relative paths with
os.pathfor cross-platform compatibility - Virtual env: Always activate
.venvbefore running scipy/numpy code
Quick Reference: Where Does It Go?
| What | Where | Example |
|---|---|---|
| Regression scripts | reports/{topic}/scripts/ |
fit_models.py |
| Validation scripts | reports/{topic}/scripts/ |
verify_regression_models.py |
| Forecasting scripts | reports/{topic}/scripts/ |
forecast_scenarios.py |
| Statistical tests | reports/{topic}/scripts/ |
hypothesis_tests.py |
| Intermediate results | reports/{topic}/data/ |
model_parameters.csv |
| Raw data | data/ (project root) |
dst.db, ev_annual_bil10.csv |
| Reusable utilities | scripts/ (project root) |
db/helpers.py, fetch_and_store.py |
Simple rule: If it uses scipy/curve_fit/statistics β reports/{topic}/scripts/ β
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Clinical Decision Support
Generate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug develo
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
