Evaluation Design
by mberto10
Use this skill when the user needs to define evaluation metrics, select datasets, or design grading/annotation strategies for agent optimization. Provides a structured, decision-driven workflow and reusable templates.
Skill Details
Repository Files
5 files in this skill directory
name: evaluation-design description: Use this skill when the user needs to define evaluation metrics, select datasets, or design grading/annotation strategies for agent optimization. Provides a structured, decision-driven workflow and reusable templates.
Evaluation Design
A systematic skill for designing metrics, datasets, and grading/annotation strategies before running an optimization loop. This skill ensures evaluations are representative, measurable, and stable across iterations.
When to Use
Use this skill when the user asks:
- “Which metrics should we track?”
- “What dataset should we use?”
- “How do we grade or annotate outputs?”
- “How should we set up evaluators or LLM judges?”
- “Design the evaluation plan for this agent.”
Outcomes
By the end, you will have:
- A metrics matrix (primary, constraints, secondary)
- A dataset strategy with sourcing, size, and coverage rules
- A grading and annotation plan (human, LLM-judge, or hybrid)
- A ready-to-run evaluation spec to insert into the optimization journal
Workflow
Step 1: Define the Target Task
Confirm the agent’s intended behavior:
- Primary user intent(s)
- Expected output format
- Tools or external data sources used
- Critical failure modes (safety, hallucinations, compliance, etc.)
Step 2: Build the Metrics Matrix
Use the template:
metrics:
primary:
- name: <metric>
definition: <what counts as success?>
scale: <e.g., binary | 1-5 | percentage>
constraints:
- name: <metric>
limit: <threshold>
reason: <why this must not regress>
secondary:
- name: <metric>
definition: <supporting metric>
Guidelines:
- Primary: one metric that represents “overall success.”
- Constraints: latency, cost, safety, policy compliance.
- Secondary: helpful but not required (helpfulness, readability, etc.).
Reference: references/metric-framework.md
Step 3: Select the Dataset Strategy
Choose between:
- Production traces (high realism)
- Curated failures (high signal for improvement)
- Synthetic cases (edge coverage)
- Public benchmarks (comparability)
Use the dataset template:
dataset:
sources:
- type: production | curated | synthetic | benchmark
description: <where it comes from>
count: <n>
coverage:
- category: <failure pattern or intent>
target_count: <n>
size_target: <total items>
refresh_policy: <when to add or retire items>
Reference: references/dataset-strategy.md
Step 4: Design Grading & Annotation
Pick a grading strategy:
- Rule-based (deterministic checks)
- LLM-as-judge (rubric-driven)
- Hybrid (rules for structure + LLM for quality)
Use the grading template:
grading:
type: rule | llm | hybrid
rubric:
- criterion: <name>
description: <what good looks like>
scale: <binary | 1-5>
judges:
- model: <judge model>
prompt: <prompt name or path>
bias_mitigations:
- randomize_order
- pairwise_comparison
calibration:
human_review_rate: <percentage>
agreement_target: <e.g., 0.8>
Reference: references/grading-annotation.md
Step 5: Produce the Evaluation Spec
Create a compact spec that can be inserted into the optimization journal:
evaluation_spec:
metrics: <from Step 2>
dataset: <from Step 3>
grading: <from Step 4>
baseline_run: "baseline"
Example: Support Triage Agent
Reference example:
references/example-support-triage.md
Summary:
- Primary: resolution accuracy
- Constraints: latency p95 < 4s, cost avg < $0.03, safety violations = 0
- Dataset: 60 production cases, 20 curated failures, 20 synthetic edge cases
- Grading: hybrid (rules for routing correctness + LLM judge for tone)
Integration with Optimization Loop
Suggested integration points:
- Initialize: run this skill before establishing the baseline
- Hypothesize: ensure new metrics align with current hypothesis
Once complete, write the evaluation spec into the journal under meta and baseline sections.
Codex Integrations
Use these Codex skills to implement the evaluation plan:
langfuse-dataset-setupfor dataset and judge configurationlangfuse-dataset-managementfor populating and curating datasetslangfuse-prompt-managementfor judge prompt creation and updateslangfuse-annotation-managerfor human review workflows
Checklist
Use this checklist before starting the optimization loop:
- Primary metric clearly defined and measurable
- Constraint metrics set with explicit thresholds
- Dataset sources chosen with coverage goals
- Grading strategy defined with calibration plan
- Evaluation spec ready for baseline run
References
references/metric-framework.mdreferences/dataset-strategy.mdreferences/grading-annotation.mdreferences/example-support-triage.md
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Team Composition Analysis
This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
