Prompt Evaluation

by pluginagentmarketplace

testing

Prompt testing, metrics, and A/B testing frameworks

Skill Details

Repository Files

4 files in this skill directory


name: prompt-evaluation description: Prompt testing, metrics, and A/B testing frameworks sasmp_version: "1.3.0" bonded_agent: 06-evaluation-testing-agent bond_type: PRIMARY_BOND

Prompt Evaluation Skill

Bonded to: evaluation-testing-agent


Quick Start

Skill("custom-plugin-prompt-engineering:prompt-evaluation")

Parameter Schema

parameters:
  evaluation_type:
    type: enum
    values: [accuracy, consistency, robustness, efficiency, ab_test]
    required: true

  test_cases:
    type: array
    items:
      input: string
      expected: string|object
      category: string
    min_items: 5

  metrics:
    type: array
    items: string
    default: [accuracy, consistency]

Evaluation Metrics

Metric Formula Target
Accuracy correct / total > 0.90
Consistency identical_runs / total_runs > 0.85
Robustness edge_cases_passed / edge_cases_total > 0.80
Format Compliance valid_format / total > 0.95
Token Efficiency quality_score / tokens_used Maximize

Test Case Framework

Test Case Schema

test_case:
  id: "TC001"
  category: "happy_path|edge_case|adversarial|regression"
  description: "What this test verifies"
  priority: "critical|high|medium|low"

  input:
    user_message: "Test input text"
    context: {}  # Optional additional context

  expected:
    contains: ["required", "phrases"]
    not_contains: ["forbidden", "content"]
    format: "json|text|markdown"
    schema: {}  # Optional JSON schema

  evaluation:
    metrics: ["accuracy", "format_compliance"]
    threshold: 0.9
    timeout: 30

Test Categories

categories:
  happy_path:
    description: "Standard expected inputs"
    coverage_target: 40%
    examples:
      - typical_user_query
      - common_use_case
      - expected_format

  edge_cases:
    description: "Boundary conditions"
    coverage_target: 25%
    examples:
      - empty_input
      - very_long_input
      - special_characters
      - unicode_text
      - minimal_input

  adversarial:
    description: "Attempts to break prompt"
    coverage_target: 20%
    examples:
      - injection_attempts
      - conflicting_instructions
      - ambiguous_requests
      - malformed_input

  regression:
    description: "Previously failed cases"
    coverage_target: 15%
    examples:
      - fixed_bugs
      - known_edge_cases

Scoring Rubric

accuracy_rubric:
  1.0: "Exact match to expected output"
  0.8: "Semantically equivalent, minor differences"
  0.5: "Partially correct, major elements present"
  0.2: "Related but incorrect"
  0.0: "Completely wrong or off-topic"

consistency_rubric:
  1.0: "Identical across all runs"
  0.8: "Minor variations (punctuation, formatting)"
  0.5: "Significant variations, same meaning"
  0.2: "Different approaches, inconsistent"
  0.0: "Completely different each time"

robustness_rubric:
  1.0: "Handles all edge cases correctly"
  0.8: "Fails gracefully with clear error messages"
  0.5: "Some edge cases cause issues"
  0.2: "Many edge cases fail"
  0.0: "Breaks on most edge cases"

A/B Testing Framework

ab_test_config:
  name: "Prompt Variant Comparison"
  hypothesis: "New prompt improves accuracy by 10%"

  variants:
    control:
      prompt: "{original_prompt}"
      allocation: 50%
    treatment:
      prompt: "{new_prompt}"
      allocation: 50%

  metrics:
    primary:
      - name: accuracy
        min_improvement: 0.05
        significance: 0.05
    secondary:
      - token_count
      - response_time
      - user_satisfaction

  sample:
    minimum: 100
    power: 0.8

  stopping_rules:
    - condition: significant_regression
      action: stop_treatment
    - condition: clear_winner
      threshold: 0.99
      action: early_stop

Evaluation Report Template

report:
  metadata:
    prompt_name: "string"
    version: "string"
    date: "ISO8601"
    evaluator: "string"

  summary:
    overall_score: 0.87
    status: "PASS|FAIL|REVIEW"
    recommendation: "string"

  metrics:
    accuracy: 0.92
    consistency: 0.88
    robustness: 0.79
    format_compliance: 0.95

  test_results:
    total: 50
    passed: 43
    failed: 7
    pass_rate: 0.86

  failures:
    - id: "TC023"
      category: "edge_case"
      issue: "Description of failure"
      severity: "high|medium|low"
      input: "..."
      expected: "..."
      actual: "..."

  recommendations:
    - priority: high
      action: "Specific improvement"
    - priority: medium
      action: "Another improvement"

Evaluation Workflow

workflow:
  1_prepare:
    - Load prompt under test
    - Load test cases
    - Configure metrics

  2_execute:
    - Run all test cases
    - Record outputs
    - Measure latency

  3_score:
    - Compare against expected
    - Calculate metrics
    - Identify patterns

  4_analyze:
    - Categorize failures
    - Find root causes
    - Prioritize issues

  5_report:
    - Generate report
    - Make recommendations
    - Archive results

Troubleshooting

Issue Cause Solution
All tests fail Wrong expected outputs Review test design
Flaky tests Non-determinism Lower temperature
False positives Lenient matching Tighten criteria
False negatives Strict matching Use semantic similarity
Slow evaluation Too many tests Sample strategically

Integration

integrates_with:
  - prompt-optimization: Improvement feedback loop
  - all skills: Quality gate before deployment

automation:
  ci_cd: "Run on prompt changes"
  scheduled: "Weekly regression"
  triggered: "On demand"

References

See references/GUIDE.md for evaluation methodology. See scripts/helper.py for automation utilities.

Related Skills

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Senior Data Scientist

World-class data science skill for statistical modeling, experimentation, causal inference, and advanced analytics. Expertise in Python (NumPy, Pandas, Scikit-learn), R, SQL, statistical methods, A/B testing, time series, and business intelligence. Includes experiment design, feature engineering, model evaluation, and stakeholder communication. Use when designing experiments, building predictive models, performing causal analysis, or driving data-driven decisions.

designtestingdata

Hypogenic

Automated hypothesis generation and testing using large language models. Use this skill when generating scientific hypotheses from datasets, combining literature insights with empirical data, testing hypotheses against observational data, or conducting systematic hypothesis exploration for research discovery in domains like deception detection, AI content detection, mental health analysis, or other empirical research tasks.

testingdata

Ux Researcher Designer

UX research and design toolkit for Senior UX Designer/Researcher including data-driven persona generation, journey mapping, usability testing frameworks, and research synthesis. Use for user research, persona creation, journey mapping, and design validation.

designtestingtool

Hypogenic

Automated LLM-driven hypothesis generation and testing on tabular datasets. Use when you want to systematically explore hypotheses about patterns in empirical data (e.g., deception detection, content analysis). Combines literature insights with data-driven hypothesis testing. For manual hypothesis formulation use hypothesis-generation; for creative ideation use scientific-brainstorming.

testingdata

Data Engineering Data Driven Feature

Build features guided by data insights, A/B testing, and continuous measurement using specialized agents for analysis, implementation, and experimentation.

testingdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Dashboard Design

USE THIS SKILL FIRST when user wants to create and design a dashboard, ESPECIALLY Vizro dashboards. This skill enforces a 3-step workflow (requirements, layout, visualization) that must be followed before implementation. For implementation and testing, use the dashboard-build skill after completing Steps 1-3.

designtestingworkflow

Ux Researcher Designer

UX research and design toolkit for Senior UX Designer/Researcher including data-driven persona generation, journey mapping, usability testing frameworks, and research synthesis. Use for user research, persona creation, journey mapping, and design validation.

designtestingtool

Performance Testing

Benchmark indicator performance with BenchmarkDotNet. Use for Series/Buffer/Stream benchmarks, regression detection, and optimization patterns. Target 1.5x Series for StreamHub, 1.2x for BufferList.

testing

Skill Information

Category:Technical
Last Updated:12/31/2025