Integrate

by synaptiai

data

Combine heterogeneous data sources into a unified model with conflict resolution, schema alignment, and provenance tracking. Use when merging data from multiple systems, consolidating information, or building comprehensive views.

Skill Details

Repository Files

3 files in this skill directory


name: integrate description: Combine heterogeneous data sources into a unified model with conflict resolution, schema alignment, and provenance tracking. Use when merging data from multiple systems, consolidating information, or building comprehensive views. argument-hint: "[sources] [target_schema] [conflict_strategy]" disable-model-invocation: false user-invocable: true allowed-tools: Read, Grep context: fork agent: explore

Intent

Perform data integration to combine information from multiple heterogeneous sources into a unified, coherent model. This handles schema differences, resolves conflicts, and preserves provenance.

Success criteria:

  • All sources are incorporated or explicitly excluded
  • Schema mismatches are resolved consistently
  • Conflicts are documented with resolution rationale
  • Provenance traces every value to its source
  • Information loss is minimized and documented

World Modeling Context: Integrate is essential for building comprehensive world-state from multiple observation sources. It combines outputs from retrieve, search, and inspect into unified models, and depends on identity-resolution for entity deduplication.

Hard dependencies:

  • Requires identity-resolution when sources contain overlapping entities

Compatible schemas:

  • schemas/output_schema.yaml

Inputs

Parameter Required Type Description
sources Yes array Data sources to integrate (files, API outputs, retrieved data)
target_schema No object Schema the integrated data should conform to
conflict_strategy No string prefer_recent, prefer_authoritative, merge, manual (default: prefer_authoritative)
constraints No object Required fields, validation rules

Procedure

  1. Analyze sources: Understand what each source contains

    • Inventory entities and attributes per source
    • Identify source schemas (explicit or inferred)
    • Note data freshness and authority
    • Detect overlapping vs. unique content
  2. Align schemas: Map source schemas to target

    • Identify equivalent fields (same meaning, different names)
    • Map types (handle type conversions)
    • Handle missing fields (default values, nulls)
    • Document unmappable fields (information loss)
  3. Resolve entity identity: Determine which records refer to same entity

    • Apply identity-resolution for overlapping entities
    • Assign canonical IDs
    • Track all source IDs as aliases
  4. Merge attributes: For each entity, combine attributes from sources

    • Non-overlapping: Simply include from source
    • Overlapping, matching: Confirm consistency
    • Overlapping, conflicting: Apply conflict strategy
  5. Apply conflict strategy:

    For prefer_recent:

    • Use the most recently updated value
    • Requires reliable timestamps

    For prefer_authoritative:

    • Rank sources by authority
    • Use value from most authoritative source

    For merge:

    • Combine values where possible (lists, sets)
    • For scalars, may require manual resolution

    For manual:

    • Flag all conflicts for human review
    • Do not auto-resolve
  6. Validate integrated data: Check against target schema

    • Required fields present
    • Type constraints satisfied
    • Relationship integrity maintained
    • Custom invariants hold
  7. Document integration: Record what happened

    • Per-field provenance (which source)
    • Conflict resolutions with rationale
    • Information loss (unmapped fields)
    • Transformation applied
  8. Produce output: Generate integrated model

    • Conforms to target schema
    • Includes provenance metadata
    • Documents confidence levels

Output Contract

Return a structured object:

integrated_data:
  schema: string | object  # Target schema reference or inline
  entities:
    - id: string  # Canonical ID
      type: string  # Entity type
      attributes: object  # Integrated attributes
      source_map:  # Where each attribute came from
        attribute_name:
          source: string
          original_value: any | null  # If transformed
          confidence: number
  relationships:
    - source: string
      target: string
      type: string
      provenance: string
integration_report:
  sources_used: array[string]
  sources_excluded: array[string]
  entity_count:
    total: integer
    from_single_source: integer
    merged: integer
  field_mapping:
    - source_field: string
      target_field: string
      transformation: string | null
  conflicts:
    - entity_id: string
      field: string
      values:
        - source: string
          value: any
      resolution: string
      resolved_value: any
      confidence: number
  information_loss:
    - source: string
      field: string
      reason: string  # unmappable, filtered, etc.
      impact: low | medium | high
  validation_results:
    passed: boolean
    errors: array[string]
    warnings: array[string]
provenance:
  sources: array[object]  # Source metadata
  integration_timestamp: string
  confidence: number
confidence: number  # 0.0-1.0 overall
evidence_anchors: array[string]
assumptions: array[string]

Field Definitions

Field Type Description
integrated_data object The unified data model
source_map object Per-attribute source tracking
integration_report.conflicts array All conflicts and resolutions
information_loss array Data that couldn't be integrated

Examples

Example 1: Integrating Service Data from Multiple Sources

Input:

sources:
  - name: "kubernetes"
    type: "api"
    data:
      - id: "api-gateway-deployment"
        status: "Running"
        replicas: 3
        last_updated: "2025-01-24T10:00:00Z"
  - name: "datadog"
    type: "api"
    data:
      - service_name: "api-gateway"
        health: "healthy"
        latency_p99: 45
        last_updated: "2025-01-24T10:05:00Z"
  - name: "package.json"
    type: "file"
    data:
      - name: "api-gateway"
        version: "2.3.1"
        dependencies: ["express", "cors"]
target_schema: "microservices_v1"
conflict_strategy: prefer_recent

Output:

integrated_data:
  schema: "microservices_v1"
  entities:
    - id: "api-gateway"
      type: "Service"
      attributes:
        name: "api-gateway"
        version: "2.3.1"
        status: "healthy"
        replicas: 3
        latency_p99: 45
        dependencies: ["express", "cors"]
      source_map:
        name:
          source: "package.json"
          original_value: null
          confidence: 1.0
        version:
          source: "package.json"
          original_value: null
          confidence: 1.0
        status:
          source: "datadog"
          original_value: "healthy"
          confidence: 0.9
        replicas:
          source: "kubernetes"
          original_value: null
          confidence: 1.0
        latency_p99:
          source: "datadog"
          original_value: null
          confidence: 0.95
        dependencies:
          source: "package.json"
          original_value: null
          confidence: 1.0
  relationships: []
integration_report:
  sources_used: ["kubernetes", "datadog", "package.json"]
  sources_excluded: []
  entity_count:
    total: 1
    from_single_source: 0
    merged: 1
  field_mapping:
    - source_field: "kubernetes.status"
      target_field: "status"
      transformation: "Mapped 'Running' to 'healthy'"
    - source_field: "datadog.service_name"
      target_field: "name"
      transformation: null
  conflicts:
    - entity_id: "api-gateway"
      field: "status"
      values:
        - source: "kubernetes"
          value: "Running"
        - source: "datadog"
          value: "healthy"
      resolution: "prefer_recent (datadog more recent)"
      resolved_value: "healthy"
      confidence: 0.85
  information_loss:
    - source: "kubernetes"
      field: "labels"
      reason: "No target schema field for Kubernetes labels"
      impact: low
  validation_results:
    passed: true
    errors: []
    warnings:
      - "Service has no APIs defined"
provenance:
  sources:
    - name: "kubernetes"
      authority: high
      freshness: "2025-01-24T10:00:00Z"
    - name: "datadog"
      authority: medium
      freshness: "2025-01-24T10:05:00Z"
    - name: "package.json"
      authority: high
      freshness: "2025-01-20T00:00:00Z"
  integration_timestamp: "2025-01-24T12:00:00Z"
  confidence: 0.9
confidence: 0.9
evidence_anchors:
  - "kubernetes_api:deployments/api-gateway-deployment"
  - "datadog:services/api-gateway"
  - "package.json:1-20"
assumptions:
  - "All sources refer to the same service"
  - "Datadog health status more accurate than K8s Running state"

Evidence pattern: Mapped fields from three sources to unified schema, resolved status conflict using recency.


Example 2: Integrating Documentation with Code Analysis

Input:

sources:
  - name: "code_analysis"
    data:
      functions:
        - name: "processOrder"
          file: "src/orders.ts"
          parameters: ["orderId", "options"]
  - name: "api_docs"
    data:
      endpoints:
        - path: "/orders/process"
          method: "POST"
          description: "Process a pending order"
          parameters:
            - name: "order_id"
              required: true
target_schema: "api_documentation"
conflict_strategy: merge

Output:

integrated_data:
  schema: "api_documentation"
  entities:
    - id: "endpoint_process_order"
      type: "Endpoint"
      attributes:
        path: "/orders/process"
        method: "POST"
        description: "Process a pending order"
        implementation: "src/orders.ts:processOrder"
        parameters:
          - name: "order_id"
            code_name: "orderId"
            required: true
            source: "body"
      source_map:
        path: { source: "api_docs" }
        method: { source: "api_docs" }
        description: { source: "api_docs" }
        implementation: { source: "code_analysis" }
        parameters: { source: "merged" }
integration_report:
  sources_used: ["code_analysis", "api_docs"]
  entity_count:
    total: 1
    merged: 1
  conflicts:
    - entity_id: "endpoint_process_order"
      field: "parameters[0].name"
      values:
        - source: "code_analysis"
          value: "orderId"
        - source: "api_docs"
          value: "order_id"
      resolution: "merge - both names preserved (code vs API)"
      resolved_value: { name: "order_id", code_name: "orderId" }
      confidence: 0.95
  information_loss: []
confidence: 0.95
evidence_anchors:
  - "src/orders.ts:processOrder"
  - "api_docs:endpoints/process"

Verification

  • All source entities are accounted for (integrated or documented as excluded)
  • Conflicts have documented resolutions
  • Output validates against target schema
  • Provenance traces every attribute to source
  • Information loss is documented

Verification tools: Schema validators, provenance checkers

Safety Constraints

  • mutation: false
  • requires_checkpoint: false
  • requires_approval: false
  • risk: low

Capability-specific rules:

  • Never silently drop data - document all information loss
  • When confidence < 0.5 on a merge, flag for review
  • Preserve source identifiers for traceability
  • Apply identity-resolution before merging overlapping entities

Composition Patterns

Commonly follows:

  • retrieve - Get data from sources before integrating
  • search - Discover relevant sources to integrate
  • identity-resolution - Resolve entity identity before merge

Commonly precedes:

  • world-state - Integration produces unified state
  • grounding - Integrated data needs grounding
  • diff-world-state - Compare integrated states over time

Anti-patterns:

  • Never integrate without understanding source schemas
  • Avoid ignoring conflicts - always resolve or flag

Workflow references:

  • See reference/composition_patterns.md#world-model-build for integration in model construction
  • See reference/composition_patterns.md#digital-twin-sync-loop for ongoing integration

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Allowed Tools:Read, Grep
Last Updated:1/30/2026