Summarize Experiment
by niznik-dev
Create a lightweight summary of experiment results from a completed (fine-tuned and evaluated) experiment. Use after run-experiment to capture key metrics from the experiment in textual form.
Skill Details
Repository Files
2 files in this skill directory
name: summarize-experiment description: Create a lightweight summary of experiment results from a completed (fine-tuned and evaluated) experiment. Use after run-experiment to capture key metrics from the experiment in textual form.
Summarize Experiment
Generate a summary.md file capturing key metrics from a completed experiment. Think R's summary() for experiment results.
Your Task
Create a lightweight summary of experiment results:
- Parse run status from experiment_summary.yaml
- Extract final training loss from SLURM stdout
- Extract accuracy from inspect-ai .eval files
- Generate summary.md in experiment directory
- Log the process in summarize-experiment.log
Prerequisites
- experiment_summary.yaml exists
- At least some runs have completed (partial results acceptable)
- run-experiment has been executed (or manual SLURM jobs run)
- Conda environment activated - The
parse_eval_log.pyscript requires inspect-ai. Activate the conda environment fromclaude.local.mdbefore running extraction commands.
Workflow
1. Locate Experiment
Find the experiment directory:
- If in an experiment directory (contains experiment_summary.yaml): use current directory
- Otherwise: ask user for path
2. Parse Run Status
Read experiment_summary.yaml to identify runs:
From runs: section:
name: Run identifiertype: "fine-tuned" or "control"model: Model nameparameters: Dict of hyperparameters (empty for control runs)
From evaluation.matrix: section:
run: Run nametasks: List of evaluation task namesepochs: List of epochs to evaluate (null for control runs)
Determine status by checking filesystem:
- Fine-tuning: Check for
{output_base}/ck-out-{run_name}/and SLURM outputs - Evaluation: Check for
{run_dir}/eval/logs/*.evalfiles
3. Extract Training Loss
For each COMPLETED fine-tuning run:
- Find SLURM stdout in the output directory:
- Parse experiment_summary.yaml "Output" section for
output_dir_base - Look in:
{output_dir_base}/ck-out-{run_name}/slurm-*.out - If multiple files, use most recent by modification time
- Parse experiment_summary.yaml "Output" section for
- Extract final loss using regex:
(\d+)\|(\d+)\|Loss: ([0-9.]+)- Pattern matches:
{epoch}|{step}|Loss: {value} - Take the LAST match to get final loss
- Pattern matches:
- Record: run_name, final_loss, epoch, step
Note: Training SLURM outputs are in the output directory, NOT the run directory.
If SLURM stdout missing:
- Log warning
- Record "N/A" for loss
- Continue with other runs
4. Extract Evaluation Accuracy
For each COMPLETED evaluation:
- Find .eval files:
{run_dir}/eval/logs/*.eval - For each .eval file, run:
python tools/inspect/parse_eval_log.py {path} - Parse JSON output for accuracy
- Map to epoch using SLURM job names (see below)
- For binary tasks, also run
summary_binary.pyto get balanced accuracy and F1 - Record: run_name, task, epoch, accuracy, balanced_accuracy, f1, samples
Script output format:
{
"status": "success",
"task": "capitalization",
"accuracy": 0.85,
"samples": 100,
"scorer": "exact_match",
"model": "..."
}
Mapping Epochs via SLURM Job Names
The .eval files don't currently store epoch information directly. To reliably map each evaluation to its epoch:
- Find SLURM output files in the eval directory:
{run_dir}/eval/slurm-*.out - Extract job IDs from filenames (e.g.,
slurm-2773062.out→ job ID 2773062) - Query job names via sacct:
sacct -j {job_ids} --format=JobID,JobName%50 - Parse epoch from job name - scaffold-inspect names jobs like
eval-{task}-{run}-ep{N}:eval-general_eval-lowlr-ep0→ epoch 0eval-general_eval-lowlr-ep9→ epoch 9
- Extract accuracy from SLURM output:
grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out
Example workflow:
# Get job names for all eval jobs
sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50
# Output shows epoch in job name:
# 2773062 eval-general_eval-lowlr-ep0
# 2773063 eval-general_eval-lowlr-ep1
# 2773065 eval-general_eval-lowlr-ep2
This approach is reliable because:
- Job names are set by scaffold-inspect and include epoch info
- Works regardless of submission order or timing
- Survives job failures and resubmissions
If extraction fails:
- Script returns
{"status": "error", "message": "..."} - Log the error
- Record "ERROR" for accuracy
- Continue with other evaluations
Computing Balanced Accuracy and F1 (Binary Classification)
For binary classification tasks (0/1 targets), use summary_binary.py to compute additional metrics:
python tools/inspect/summary_binary.py {path_to_eval_file} --json
JSON output format:
{
"status": "success",
"path": "/path/to/file.eval",
"samples": 100,
"accuracy": 0.85,
"balanced_accuracy": 0.83,
"f1": 0.82,
"precision_1": 0.80,
"recall_1": 0.84,
"recall_0": 0.82,
"confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0}
}
Why these metrics matter for imbalanced data:
- Balanced Accuracy = (Recall_0 + Recall_1) / 2 — not inflated by majority class
- F1 Score = harmonic mean of precision and recall — penalizes class imbalance
Note: For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-").
5. Generate summary.md
Create {experiment_dir}/summary.md with the following structure:
# Experiment Summary
**Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete
## Run Status
| Run | Type | Fine-tuning | Evaluation |
|-----|------|-------------|------------|
| rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| base_model | Control | N/A | COMPLETED |
## Training Results
| Run | Final Loss | Epochs | Duration |
|-----|------------|--------|----------|
| rank4_lr1e-5 | 0.234 | 2 | 8m 15s |
| rank8_lr1e-5 | 0.198 | 2 | 9m 02s |
**Notes:**
- Base model runs have no training loss (control)
- Duration from SLURM elapsed time (if available)
## Evaluation Results
| Run | Task | Epoch | Accuracy | Bal. Acc | F1 | Samples |
|-----|------|-------|----------|----------|------|---------|
| rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 |
| rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 |
| rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 |
| rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 |
| base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 |
**Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy
## Incomplete Runs
| Run | Stage | Status | Notes |
|-----|-------|--------|-------|
| rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out |
## Next Steps
1. View detailed evaluation results: `inspect view --port=$(get_free_port)`
2. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv`
3. Full analysis: `analyze-experiment` (when available)
---
*Generated by summarize-experiment skill*
6. Create Log
Document the process in {experiment_dir}/summarize-experiment.log.
See logging.md for action types and format.
Error Handling
If SLURM stdout missing
- Log warning with action type
EXTRACT_LOSS - Record "N/A" for loss in summary
- Continue with other runs
If .eval file cannot be parsed
- Log error with file path
- Record "ERROR" for accuracy in summary
- Continue with other evaluations
If all runs failed
- Generate summary noting all failures
- Include failure states in "Incomplete Runs" section
- Suggest troubleshooting steps
If partial results
- Generate summary with available data
- Clearly indicate which runs are missing in "Incomplete Runs" section
- Still identify best performing run from available data
Idempotency
Running summarize-experiment multiple times overwrites summary.md. This is intentional:
- Allows re-running after fixing failed runs
- Summary always reflects current state
Output Files
{experiment_dir}/
├── summary.md # Human-readable summary (new)
└── summarize-experiment.log # Process log (new)
Relationship to Other Skills
- After: run-experiment (or manual execution)
- Before: analyze-experiment (when available)
- Optional hook: run-experiment can invoke this at completion
Future Compatibility
When analyze-experiment is built, summarize-experiment can either:
- Remain as a quick summary option (text only, no plots)
- Be deprecated in favor of richer output
- Become a first stage that analyze-experiment builds upon
Related Skills
Attack Tree Construction
Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
Grafana Dashboards
Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Matplotlib
Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.
Scientific Visualization
Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.
Seaborn
Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.
Shap
Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Query Writing
For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations
Pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.
Scientific Visualization
Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.
