Quality Audit

by grahama1970

skill

>

Skill Details

Repository Files

6 files in this skill directory


name: quality-audit description: > Stratified quality sampling and statistical validation for LLM outputs. Works with ANY project's extraction results. Calculates confidence intervals, chi-square tests, and stratified sampling by configurable dimensions. allowed-tools: ["Bash", "Read", "Write"] triggers:

  • quality-audit
  • quality audit
  • audit quality
  • stratified sample
  • statistical validation
  • llm quality check
  • extraction quality metadata: short-description: Statistical quality validation for LLM extraction results

Quality Audit Skill

Stratified quality sampling and statistical validation for LLM extraction results. Works with any project - not tied to specific extraction tasks.

Key Features

  1. Stratified Sampling: Sample by any dimension (framework, source, confidence)
  2. Statistical Rigor: Chi-square tests, 95% confidence intervals
  3. Generic Interface: Works with any JSON/JSONL extraction results
  4. UltraThink Mode: High-token reasoning for difficult edge cases
  5. Quality Gates: Configurable thresholds for pass/fail

Quick Start

cd /home/graham/workspace/experiments/pi-mono/.pi/skills/quality-audit

# Sample from extraction results
./run.sh sample --input results.jsonl --stratify framework --samples-per-stratum 5

# Audit samples with LLM verification
./run.sh audit --samples samples.json --threshold 0.85

# Generate full quality report
./run.sh report --input results.jsonl --output quality_report.md

# UltraThink mode for difficult cases (uses more tokens)
./run.sh audit --samples samples.json --ultrathink

Commands

sample - Stratified Sampling

./run.sh sample --input results.jsonl --stratify framework --samples-per-stratum 5

# Options:
#   --input FILE            Input JSON/JSONL with extraction results
#   --stratify DIMENSION    Dimension to stratify by (framework, source, confidence)
#   --samples-per-stratum N Number of samples per stratum (default: 5)
#   --seed INT              Random seed for reproducibility (default: 42)
#   --output FILE           Output file for samples (default: samples.json)

audit - Quality Verification

./run.sh audit --samples samples.json --threshold 0.85

# Options:
#   --samples FILE          Sampled cases to audit
#   --threshold FLOAT       Minimum accuracy to pass (default: 0.85)
#   --model NAME            LLM model for verification (default: from env)
#   --ultrathink            Enable UltraThink mode (more tokens, deeper reasoning)
#   --human                 Generate review document for human verification
#   --output FILE           Output audit results

report - Full Quality Report

./run.sh report --input results.jsonl --output quality_report.md

# Options:
#   --input FILE            Full extraction results
#   --output FILE           Report output path
#   --samples-per-stratum N Samples for statistical estimation (default: 10)
#   --include-chi-square    Include chi-square agreement test

Stratification Dimensions

The skill supports stratifying by any field in your data. Common dimensions:

Dimension Example Values Use Case
framework D3FEND, ATT&CK, NIST, CWE Validate coverage across frameworks
source deterministic, llm, keyword_fallback Compare extraction methods
confidence low (0-0.5), med (0.5-0.8), high Focus on uncertain cases
collection controls, techniques, mitigations Multi-collection validation

Custom Stratification

For custom fields:

./run.sh sample --input results.jsonl --stratify "metadata.worksheet_type" --samples-per-stratum 3

Statistical Tests

95% Confidence Intervals

For each stratum and overall:

Overall Accuracy: 87.5% +/- 4.2% (95% CI)

Calculated as: p +/- 1.96 * sqrt(p * (1-p) / n)

Chi-Square Agreement Test

Compare LLM results vs deterministic mappings:

./run.sh report --input results.jsonl --include-chi-square

Output:

Chi-Square Agreement Test:
  - Null hypothesis: LLM and deterministic agree at random
  - Chi-square statistic: 45.2
  - p-value: < 0.001
  - Interpretation: Strong agreement (not random)

Sample Size Calculation

Calculate required samples for target precision:

./run.sh sample-size --target-precision 0.05 --expected-accuracy 0.85
# Output: Need 196 samples for +/- 5% precision at 85% expected accuracy

UltraThink Mode

For difficult edge cases, enable UltraThink mode which:

  1. Uses extended thinking budget (more tokens)
  2. Requires explicit reasoning chain before verdict
  3. Requests multiple verification passes
  4. Higher confidence in final verdict
./run.sh audit --samples samples.json --ultrathink --model deepseek

# UltraThink prompt includes:
# "Take your time and think through this carefully. Consider:
#  1. What is the primary function of this control?
#  2. Which taxonomy tier best captures its essence?
#  3. Are there edge cases or ambiguities?
#  4. Confidence in your assessment?
#  Reason through each step before giving your final answer."

Input Format

The skill accepts JSON or JSONL with extraction results. Required fields:

{
  "id": "D3-FEV",
  "input": {
    "name": "File Eviction",
    "description": "Removes files from system"
  },
  "output": {
    "conceptual": ["Precision", "Resilience"],
    "tactical": ["Evict", "Restore"]
  },
  "metadata": {
    "source": "llm",
    "confidence": 0.9,
    "framework": "D3FEND"
  }
}

Flexible field mapping via --id-field, --output-field, --metadata-field.

Output Format

Sample Output (samples.json)

{
  "created": "2026-01-29T10:00:00Z",
  "seed": 42,
  "stratify_by": "framework",
  "samples_per_stratum": 5,
  "strata": {
    "D3FEND": [...],
    "ATT&CK": [...],
    "NIST": [...],
    "CWE": [...]
  }
}

Audit Output (audit_results.json)

{
  "timestamp": "2026-01-29T10:00:00Z",
  "model": "deepseek",
  "ultrathink": false,
  "results": {
    "overall_accuracy": 0.875,
    "confidence_interval_95": "87.5% +/- 4.2%",
    "per_stratum": {
      "D3FEND": {"sampled": 5, "correct": 5, "accuracy": 1.0},
      "ATT&CK": {"sampled": 5, "correct": 4, "accuracy": 0.8}
    },
    "chi_square": {
      "statistic": 45.2,
      "p_value": 0.0001,
      "conclusion": "Strong agreement"
    }
  },
  "quality_gate": {
    "threshold": 0.85,
    "passed": true
  }
}

Report Output (quality_report.md)

# Quality Audit Report

## Summary
- **Total Records**: 4011
- **Sampled**: 40 (10 per stratum)
- **Overall Accuracy**: 87.5% +/- 4.2%
- **Quality Gate**: PASS (threshold: 85%)

## Per-Stratum Results
| Stratum | Sampled | Correct | Accuracy | 95% CI |
|---------|---------|---------|----------|--------|
| D3FEND  | 10      | 10      | 100%     | +/- 0% |
| ATT&CK  | 10      | 8       | 80%      | +/- 12.5% |
...

## Chi-Square Agreement Test
...

## Recommendations
...

Integration with Projects

SPARTA Integration

from quality_audit import stratified_sample, audit_samples, quality_report

# Sample from DuckDB results
samples = stratified_sample(
    input_source=conn,  # DuckDB connection
    query="SELECT * FROM bridge_tag_results",
    stratify_by="framework",
    samples_per_stratum=10,
)

# Audit with LLM
results = audit_samples(samples, model="deepseek", ultrathink=True)

# Generate report
report = quality_report(results, threshold=0.85)

Generic JSON/JSONL

# From any JSON extraction results
./run.sh sample --input extractions.jsonl --stratify source --samples-per-stratum 5
./run.sh audit --samples samples.json --threshold 0.85

Configuration

Environment variables:

# LLM for verification (optional - uses configured model)
QUALITY_AUDIT_MODEL=deepseek

# Default threshold
QUALITY_AUDIT_THRESHOLD=0.85

# UltraThink token budget
QUALITY_AUDIT_ULTRATHINK_TOKENS=4096

Quality Gates for CI/CD

# Exit code 0 = pass, 1 = fail
./run.sh audit --samples samples.json --threshold 0.85

# Use in pipelines
if ./run.sh audit --samples samples.json --threshold 0.85; then
    echo "Quality gate passed"
else
    echo "Quality gate failed - accuracy below threshold"
    exit 1
fi

Use Cases

  1. SPARTA Bridge Tags: Validate taxonomy extraction quality
  2. QRA Generation: Verify question-answer pairs are accurate
  3. Document Extraction: Check text extraction quality
  4. Classification Tasks: Any LLM classification with ground truth
  5. A/B Testing: Compare extraction approaches statistically

Related Skills

Attack Tree Construction

Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.

skill

Grafana Dashboards

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

skill

Matplotlib

Foundational plotting library. Create line plots, scatter, bar, histograms, heatmaps, 3D, subplots, export PNG/PDF/SVG, for scientific visualization and publication figures.

skill

Scientific Visualization

Create publication figures with matplotlib/seaborn/plotly. Multi-panel layouts, error bars, significance markers, colorblind-safe, export PDF/EPS/TIFF, for journal-ready scientific plots.

skill

Seaborn

Statistical visualization. Scatter, box, violin, heatmaps, pair plots, regression, correlation matrices, KDE, faceted plots, for exploratory analysis and publication figures.

skill

Shap

Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model

skill

Pydeseq2

Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.

skill

Query Writing

For writing and executing SQL queries - from simple single-table queries to complex multi-table JOINs and aggregations

skill

Pydeseq2

Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis.

skill

Scientific Visualization

Meta-skill for publication-ready figures. Use when creating journal submission figures requiring multi-panel layouts, significance annotations, error bars, colorblind-safe palettes, and specific journal formatting (Nature, Science, Cell). Orchestrates matplotlib/seaborn/plotly with publication styles. For quick exploration use seaborn or plotly directly.

skill

Skill Information

Category:Skill
Allowed Tools:["Bash", "Read", "Write"]
Last Updated:1/29/2026