Prompt Evaluation
by pluginagentmarketplace
Prompt testing, metrics, and A/B testing frameworks
Skill Details
Repository Files
4 files in this skill directory
name: prompt-evaluation description: Prompt testing, metrics, and A/B testing frameworks sasmp_version: "1.3.0" bonded_agent: 06-evaluation-testing-agent bond_type: PRIMARY_BOND
Prompt Evaluation Skill
Bonded to: evaluation-testing-agent
Quick Start
Skill("custom-plugin-prompt-engineering:prompt-evaluation")
Parameter Schema
parameters:
evaluation_type:
type: enum
values: [accuracy, consistency, robustness, efficiency, ab_test]
required: true
test_cases:
type: array
items:
input: string
expected: string|object
category: string
min_items: 5
metrics:
type: array
items: string
default: [accuracy, consistency]
Evaluation Metrics
| Metric | Formula | Target |
|---|---|---|
| Accuracy | correct / total | > 0.90 |
| Consistency | identical_runs / total_runs | > 0.85 |
| Robustness | edge_cases_passed / edge_cases_total | > 0.80 |
| Format Compliance | valid_format / total | > 0.95 |
| Token Efficiency | quality_score / tokens_used | Maximize |
Test Case Framework
Test Case Schema
test_case:
id: "TC001"
category: "happy_path|edge_case|adversarial|regression"
description: "What this test verifies"
priority: "critical|high|medium|low"
input:
user_message: "Test input text"
context: {} # Optional additional context
expected:
contains: ["required", "phrases"]
not_contains: ["forbidden", "content"]
format: "json|text|markdown"
schema: {} # Optional JSON schema
evaluation:
metrics: ["accuracy", "format_compliance"]
threshold: 0.9
timeout: 30
Test Categories
categories:
happy_path:
description: "Standard expected inputs"
coverage_target: 40%
examples:
- typical_user_query
- common_use_case
- expected_format
edge_cases:
description: "Boundary conditions"
coverage_target: 25%
examples:
- empty_input
- very_long_input
- special_characters
- unicode_text
- minimal_input
adversarial:
description: "Attempts to break prompt"
coverage_target: 20%
examples:
- injection_attempts
- conflicting_instructions
- ambiguous_requests
- malformed_input
regression:
description: "Previously failed cases"
coverage_target: 15%
examples:
- fixed_bugs
- known_edge_cases
Scoring Rubric
accuracy_rubric:
1.0: "Exact match to expected output"
0.8: "Semantically equivalent, minor differences"
0.5: "Partially correct, major elements present"
0.2: "Related but incorrect"
0.0: "Completely wrong or off-topic"
consistency_rubric:
1.0: "Identical across all runs"
0.8: "Minor variations (punctuation, formatting)"
0.5: "Significant variations, same meaning"
0.2: "Different approaches, inconsistent"
0.0: "Completely different each time"
robustness_rubric:
1.0: "Handles all edge cases correctly"
0.8: "Fails gracefully with clear error messages"
0.5: "Some edge cases cause issues"
0.2: "Many edge cases fail"
0.0: "Breaks on most edge cases"
A/B Testing Framework
ab_test_config:
name: "Prompt Variant Comparison"
hypothesis: "New prompt improves accuracy by 10%"
variants:
control:
prompt: "{original_prompt}"
allocation: 50%
treatment:
prompt: "{new_prompt}"
allocation: 50%
metrics:
primary:
- name: accuracy
min_improvement: 0.05
significance: 0.05
secondary:
- token_count
- response_time
- user_satisfaction
sample:
minimum: 100
power: 0.8
stopping_rules:
- condition: significant_regression
action: stop_treatment
- condition: clear_winner
threshold: 0.99
action: early_stop
Evaluation Report Template
report:
metadata:
prompt_name: "string"
version: "string"
date: "ISO8601"
evaluator: "string"
summary:
overall_score: 0.87
status: "PASS|FAIL|REVIEW"
recommendation: "string"
metrics:
accuracy: 0.92
consistency: 0.88
robustness: 0.79
format_compliance: 0.95
test_results:
total: 50
passed: 43
failed: 7
pass_rate: 0.86
failures:
- id: "TC023"
category: "edge_case"
issue: "Description of failure"
severity: "high|medium|low"
input: "..."
expected: "..."
actual: "..."
recommendations:
- priority: high
action: "Specific improvement"
- priority: medium
action: "Another improvement"
Evaluation Workflow
workflow:
1_prepare:
- Load prompt under test
- Load test cases
- Configure metrics
2_execute:
- Run all test cases
- Record outputs
- Measure latency
3_score:
- Compare against expected
- Calculate metrics
- Identify patterns
4_analyze:
- Categorize failures
- Find root causes
- Prioritize issues
5_report:
- Generate report
- Make recommendations
- Archive results
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| All tests fail | Wrong expected outputs | Review test design |
| Flaky tests | Non-determinism | Lower temperature |
| False positives | Lenient matching | Tighten criteria |
| False negatives | Strict matching | Use semantic similarity |
| Slow evaluation | Too many tests | Sample strategically |
Integration
integrates_with:
- prompt-optimization: Improvement feedback loop
- all skills: Quality gate before deployment
automation:
ci_cd: "Run on prompt changes"
scheduled: "Weekly regression"
triggered: "On demand"
References
See references/GUIDE.md for evaluation methodology.
See scripts/helper.py for automation utilities.
Related Skills
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Senior Data Scientist
World-class data science skill for statistical modeling, experimentation, causal inference, and advanced analytics. Expertise in Python (NumPy, Pandas, Scikit-learn), R, SQL, statistical methods, A/B testing, time series, and business intelligence. Includes experiment design, feature engineering, model evaluation, and stakeholder communication. Use when designing experiments, building predictive models, performing causal analysis, or driving data-driven decisions.
Hypogenic
Automated hypothesis generation and testing using large language models. Use this skill when generating scientific hypotheses from datasets, combining literature insights with empirical data, testing hypotheses against observational data, or conducting systematic hypothesis exploration for research discovery in domains like deception detection, AI content detection, mental health analysis, or other empirical research tasks.
Ux Researcher Designer
UX research and design toolkit for Senior UX Designer/Researcher including data-driven persona generation, journey mapping, usability testing frameworks, and research synthesis. Use for user research, persona creation, journey mapping, and design validation.
Hypogenic
Automated LLM-driven hypothesis generation and testing on tabular datasets. Use when you want to systematically explore hypotheses about patterns in empirical data (e.g., deception detection, content analysis). Combines literature insights with data-driven hypothesis testing. For manual hypothesis formulation use hypothesis-generation; for creative ideation use scientific-brainstorming.
Data Engineering Data Driven Feature
Build features guided by data insights, A/B testing, and continuous measurement using specialized agents for analysis, implementation, and experimentation.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Dashboard Design
USE THIS SKILL FIRST when user wants to create and design a dashboard, ESPECIALLY Vizro dashboards. This skill enforces a 3-step workflow (requirements, layout, visualization) that must be followed before implementation. For implementation and testing, use the dashboard-build skill after completing Steps 1-3.
Ux Researcher Designer
UX research and design toolkit for Senior UX Designer/Researcher including data-driven persona generation, journey mapping, usability testing frameworks, and research synthesis. Use for user research, persona creation, journey mapping, and design validation.
Performance Testing
Benchmark indicator performance with BenchmarkDotNet. Use for Series/Buffer/Stream benchmarks, regression detection, and optimization patterns. Target 1.5x Series for StreamHub, 1.2x for BufferList.
