Langfuse Dataset Management

by mberto10

testingdata

This skill should be used when the user asks to "create dataset", "add trace to dataset", "curate regression tests", "build test set from traces", "list datasets", "show dataset items", or needs to manage Langfuse datasets for experiment validation and regression testing.

Skill Details

Repository Files

2 files in this skill directory


name: langfuse-dataset-management description: This skill should be used when the user asks to "create dataset", "add trace to dataset", "curate regression tests", "build test set from traces", "list datasets", "show dataset items", or needs to manage Langfuse datasets for experiment validation and regression testing.

Langfuse Dataset Management

Create and manage regression test datasets from production traces for validation and testing.

When to Use

  • Curating failing traces into regression datasets
  • Building golden test sets from high-quality examples
  • Adding specific traces to existing datasets
  • Listing available datasets and their items
  • Preparing data for validation testing

Naming Convention

Recommended format: {project}_{purpose} or {workflow}_{purpose}

Examples:

  • checkout_regressions - Failing traces for checkout flow
  • api_v2_golden_set - High-quality verified outputs
  • auth_edge_cases - Edge cases for authentication workflow

Operations

Create Dataset

python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  create \
  --name "checkout_regressions" \
  --description "Failing traces for checkout flow issues" \
  --metadata '{"project": "checkout", "purpose": "regression"}'

Add Single Trace

python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  add-trace \
  --dataset "checkout_regressions" \
  --trace-id abc123def456 \
  --expected-score 9.0

Add with Custom Expected Output

python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  add-trace \
  --dataset "checkout_regressions" \
  --trace-id abc123def456 \
  --expected-output '{"min_score": 9.0, "required_fields": ["summary", "recommendations"]}'

Add Multiple Traces (Batch)

# Create file with trace IDs (one per line)
echo "trace_id_1
trace_id_2
trace_id_3" > failing_traces.txt

python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  add-batch \
  --dataset "checkout_regressions" \
  --trace-file failing_traces.txt \
  --expected-score 9.0

List All Datasets

python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py list

Get Dataset Items

python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  get \
  --name "checkout_regressions"

Python SDK Note

When using the Langfuse Python SDK directly (not via CLI), use the correct method for adding items:

from langfuse import Langfuse
lf = Langfuse()

# Correct: use lf.create_dataset_item()
lf.create_dataset_item(
    dataset_name="checkout_regressions",
    input={"query": "example input"},
    expected_output={"min_score": 9.0},
    metadata={"source_trace_id": "abc123"}
)

# Incorrect: dataset.create_item() does not exist in the SDK
# dataset = lf.get_dataset("checkout_regressions")
# dataset.create_item(...)  # ← This will fail!

Key difference: The SDK method is lf.create_dataset_item() with dataset_name as a parameter, not dataset.create_item() on a dataset object.

Dataset Item Structure

When adding a trace to a dataset, the tool extracts:

Input (from trace): The trace's input data merged with its metadata. All fields from the original trace are preserved.

Expected Output (from arguments):

{
  "min_score": 9.0
}

Or custom expectations:

{
  "min_score": 8.5,
  "required_fields": ["summary", "recommendations"]
}

Metadata (automatic):

{
  "source_trace_id": "abc123",
  "added_date": "2025-12-19",
  "original_score": 6.2
}

Common Workflows

Workflow 1: Create Regression Dataset from Failing Traces

  1. Find failing traces (using data-retrieval skill):
python3 ${CLAUDE_PLUGIN_ROOT}/skills/data-retrieval/helpers/trace_retriever.py \
  --last 20 --max-score 7.0 --mode minimal
  1. Create dataset:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  create \
  --name "checkout_regressions" \
  --description "Failing traces for checkout fixes"
  1. Extract trace IDs (from step 1 output) and save to file

  2. Add traces to dataset:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  add-batch \
  --dataset "checkout_regressions" \
  --trace-file failing_ids.txt \
  --expected-score 9.0

Workflow 2: Build Golden Test Set

  1. Find high-quality traces:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/data-retrieval/helpers/trace_retriever.py \
  --last 10 --min-score 9.0 --mode minimal
  1. Create golden set dataset:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  create \
  --name "api_golden_set" \
  --description "Verified high-quality outputs for baseline"
  1. Add traces:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  add-batch \
  --dataset "api_golden_set" \
  --trace-file golden_ids.txt \
  --expected-score 9.0

Workflow 3: Add Specific Failing Trace

When you identify a specific failure during investigation:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/dataset-management/helpers/dataset_manager.py \
  add-trace \
  --dataset "checkout_regressions" \
  --trace-id problematic_trace_id_here \
  --expected-score 9.0 \
  --failure-reason "Payment processing timeout"

Required Environment Variables

Same as data-retrieval skill:

LANGFUSE_PUBLIC_KEY=pk-...    # Required
LANGFUSE_SECRET_KEY=sk-...    # Required
LANGFUSE_HOST=https://cloud.langfuse.com  # Optional

Troubleshooting

Dataset already exists:

  • Use a different name or delete the existing dataset from Langfuse UI

Trace not found:

  • Verify trace ID is correct
  • Check that trace is within the retention period

Rate limiting:

  • When adding many traces, the tool may hit API rate limits
  • Consider adding traces in smaller batches

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Technical
Last Updated:1/13/2026