Evaluation Design

by mberto10

designworkflowtemplatedata

Use this skill when the user needs to define evaluation metrics, select datasets, or design grading/annotation strategies for agent optimization. Provides a structured, decision-driven workflow and reusable templates.

Skill Details

Repository Files

5 files in this skill directory


name: evaluation-design description: Use this skill when the user needs to define evaluation metrics, select datasets, or design grading/annotation strategies for agent optimization. Provides a structured, decision-driven workflow and reusable templates.

Evaluation Design

A systematic skill for designing metrics, datasets, and grading/annotation strategies before running an optimization loop. This skill ensures evaluations are representative, measurable, and stable across iterations.

When to Use

Use this skill when the user asks:

  • “Which metrics should we track?”
  • “What dataset should we use?”
  • “How do we grade or annotate outputs?”
  • “How should we set up evaluators or LLM judges?”
  • “Design the evaluation plan for this agent.”

Outcomes

By the end, you will have:

  • A metrics matrix (primary, constraints, secondary)
  • A dataset strategy with sourcing, size, and coverage rules
  • A grading and annotation plan (human, LLM-judge, or hybrid)
  • A ready-to-run evaluation spec to insert into the optimization journal

Workflow

Step 1: Define the Target Task

Confirm the agent’s intended behavior:

  • Primary user intent(s)
  • Expected output format
  • Tools or external data sources used
  • Critical failure modes (safety, hallucinations, compliance, etc.)

Step 2: Build the Metrics Matrix

Use the template:

metrics:
  primary:
    - name: <metric>
      definition: <what counts as success?>
      scale: <e.g., binary | 1-5 | percentage>
  constraints:
    - name: <metric>
      limit: <threshold>
      reason: <why this must not regress>
  secondary:
    - name: <metric>
      definition: <supporting metric>

Guidelines:

  • Primary: one metric that represents “overall success.”
  • Constraints: latency, cost, safety, policy compliance.
  • Secondary: helpful but not required (helpfulness, readability, etc.).

Reference: references/metric-framework.md

Step 3: Select the Dataset Strategy

Choose between:

  • Production traces (high realism)
  • Curated failures (high signal for improvement)
  • Synthetic cases (edge coverage)
  • Public benchmarks (comparability)

Use the dataset template:

dataset:
  sources:
    - type: production | curated | synthetic | benchmark
      description: <where it comes from>
      count: <n>
  coverage:
    - category: <failure pattern or intent>
      target_count: <n>
  size_target: <total items>
  refresh_policy: <when to add or retire items>

Reference: references/dataset-strategy.md

Step 4: Design Grading & Annotation

Pick a grading strategy:

  • Rule-based (deterministic checks)
  • LLM-as-judge (rubric-driven)
  • Hybrid (rules for structure + LLM for quality)

Use the grading template:

grading:
  type: rule | llm | hybrid
  rubric:
    - criterion: <name>
      description: <what good looks like>
      scale: <binary | 1-5>
  judges:
    - model: <judge model>
      prompt: <prompt name or path>
      bias_mitigations:
        - randomize_order
        - pairwise_comparison
  calibration:
    human_review_rate: <percentage>
    agreement_target: <e.g., 0.8>

Reference: references/grading-annotation.md

Step 5: Produce the Evaluation Spec

Create a compact spec that can be inserted into the optimization journal:

evaluation_spec:
  metrics: <from Step 2>
  dataset: <from Step 3>
  grading: <from Step 4>
  baseline_run: "baseline"

Example: Support Triage Agent

Reference example: references/example-support-triage.md

Summary:

  • Primary: resolution accuracy
  • Constraints: latency p95 < 4s, cost avg < $0.03, safety violations = 0
  • Dataset: 60 production cases, 20 curated failures, 20 synthetic edge cases
  • Grading: hybrid (rules for routing correctness + LLM judge for tone)

Integration with Optimization Loop

Suggested integration points:

  • Initialize: run this skill before establishing the baseline
  • Hypothesize: ensure new metrics align with current hypothesis

Once complete, write the evaluation spec into the journal under meta and baseline sections.


Codex Integrations

Use these Codex skills to implement the evaluation plan:

  • langfuse-dataset-setup for dataset and judge configuration
  • langfuse-dataset-management for populating and curating datasets
  • langfuse-prompt-management for judge prompt creation and updates
  • langfuse-annotation-manager for human review workflows

Checklist

Use this checklist before starting the optimization loop:

  • Primary metric clearly defined and measurable
  • Constraint metrics set with explicit thresholds
  • Dataset sources chosen with coverage goals
  • Grading strategy defined with calibration plan
  • Evaluation spec ready for baseline run

References

  • references/metric-framework.md
  • references/dataset-strategy.md
  • references/grading-annotation.md
  • references/example-support-triage.md

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Team Composition Analysis

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

artdesign

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Skill Information

Category:Creative
Last Updated:1/11/2026