name: dataset-comparer description: Compare two datasets to find differences, added/removed rows, changed values. Use for data validation, ETL verification, or tracking changes.

Dataset Comparer

Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.

Features

Row Comparison: Find added, removed, and matching rows
Value Changes: Detect changed values in matching rows
Column Comparison: Identify schema differences
Statistics: Summary of differences
Diff Reports: HTML, CSV, and JSON output
Flexible Matching: Compare by key columns or row position

Quick Start

from dataset_comparer import DatasetComparer

comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")

# Compare by key column
diff = comparer.compare(key_columns=["id"])

print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")

# Generate report
comparer.generate_report("diff_report.html")

CLI Usage

# Basic comparison
python dataset_comparer.py --old old.csv --new new.csv

# Compare by key column
python dataset_comparer.py --old old.csv --new new.csv --key id

# Multiple key columns
python dataset_comparer.py --old old.csv --new new.csv --key id,date

# Generate HTML report
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html

# Export differences to CSV
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv

# JSON output
python dataset_comparer.py --old old.csv --new new.csv --key id --json

# Ignore specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date

# Compare only specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status

API Reference

DatasetComparer Class

class DatasetComparer:
    def __init__(self)

    # Data loading
    def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
    def load_dataframes(self, old_df: pd.DataFrame,
                       new_df: pd.DataFrame) -> 'DatasetComparer'

    # Comparison
    def compare(self, key_columns: list = None,
               ignore_columns: list = None,
               compare_columns: list = None) -> dict

    # Detailed results
    def get_added_rows(self) -> pd.DataFrame
    def get_removed_rows(self) -> pd.DataFrame
    def get_changed_rows(self) -> pd.DataFrame
    def get_unchanged_rows(self) -> pd.DataFrame

    # Schema comparison
    def compare_schema(self) -> dict

    # Reports
    def generate_report(self, output: str, format: str = "html") -> str
    def to_dataframe(self) -> pd.DataFrame
    def summary(self) -> str

Comparison Methods

Key-Based Comparison

Compare rows by matching key columns (like primary keys):

diff = comparer.compare(key_columns=["customer_id"])

# Multiple keys for composite matching
diff = comparer.compare(key_columns=["order_id", "product_id"])

Position-Based Comparison

Compare rows by their position (row number):

diff = comparer.compare()  # No keys = positional comparison

Output Format

Comparison Result

{
    "summary": {
        "old_rows": 1000,
        "new_rows": 1050,
        "added_count": 75,
        "removed_count": 25,
        "changed_count": 50,
        "unchanged_count": 900,
        "total_differences": 150
    },
    "schema_changes": {
        "added_columns": ["new_field"],
        "removed_columns": ["old_field"],
        "type_changes": [
            {"column": "amount", "old_type": "int64", "new_type": "float64"}
        ]
    },
    "key_columns": ["id"],
    "compared_columns": ["name", "email", "status"],
    "ignored_columns": ["updated_at"]
}

Changed Row Details

changes = comparer.get_changed_rows()

# Returns DataFrame with columns:
# _key: Key value(s) for the row
# _column: Column that changed
# _old_value: Original value
# _new_value: New value

Schema Comparison

Compare column structure:

schema = comparer.compare_schema()

# Returns:
{
    "old_columns": ["id", "name", "old_field"],
    "new_columns": ["id", "name", "new_field"],
    "common_columns": ["id", "name"],
    "added_columns": ["new_field"],
    "removed_columns": ["old_field"],
    "type_changes": [
        {"column": "price", "old_type": "int64", "new_type": "float64"}
    ],
    "old_row_count": 1000,
    "new_row_count": 1050
}

Filtering Options

Ignore Columns

Skip certain columns during comparison:

diff = comparer.compare(
    key_columns=["id"],
    ignore_columns=["updated_at", "modified_by", "timestamp"]
)

Compare Specific Columns

Only compare selected columns:

diff = comparer.compare(
    key_columns=["id"],
    compare_columns=["name", "email", "status"]  # Only these columns
)

Report Formats

HTML Report

comparer.generate_report("diff_report.html", format="html")

Features:

Summary statistics
Interactive tables
Color-coded changes (green=added, red=removed, yellow=changed)
Schema comparison section

CSV Export

comparer.generate_report("diff_report.csv", format="csv")

Includes all differences in tabular format.

JSON Output

comparer.generate_report("diff_report.json", format="json")

Complete diff data in JSON format.

Example Workflows

Data Migration Validation

comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")

diff = comparer.compare(key_columns=["id"])

if diff["summary"]["total_differences"] == 0:
    print("Migration successful - no differences!")
else:
    print(f"Found {diff['summary']['total_differences']} differences")
    comparer.generate_report("migration_issues.html")

ETL Pipeline Verification

comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")

diff = comparer.compare(
    key_columns=["transaction_id"],
    ignore_columns=["processing_timestamp"]
)

# Check for unexpected changes
changed = comparer.get_changed_rows()
if len(changed) > 0:
    print("Warning: Historical records changed!")
    print(changed)

Incremental Update Detection

comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")

diff = comparer.compare(key_columns=["customer_id"])

# Get new records for processing
new_records = comparer.get_added_rows()
print(f"New records to process: {len(new_records)}")

# Get deleted records
deleted = comparer.get_removed_rows()
print(f"Records to deactivate: {len(deleted)}")

Schema Change Detection

comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")

schema = comparer.compare_schema()

if schema["added_columns"]:
    print(f"New columns: {schema['added_columns']}")

if schema["removed_columns"]:
    print(f"Removed columns: {schema['removed_columns']}")

if schema["type_changes"]:
    for change in schema["type_changes"]:
        print(f"Type change: {change['column']} "
              f"({change['old_type']} -> {change['new_type']})")

Large Dataset Tips

For very large datasets:

# Compare in chunks
comparer = DatasetComparer()
comparer.load("large_old.csv", "large_new.csv")

# Use key column for efficient matching
diff = comparer.compare(key_columns=["id"])

# Export only differences (not full data)
comparer.generate_report("diff_only.csv", format="csv")

Dependencies

pandas>=2.0.0
numpy>=1.24.0

Dataset Comparer

Skill Details

Repository Files

name: dataset-comparer description: Compare two datasets to find differences, added/removed rows, changed values. Use for data validation, ETL verification, or tracking changes.

Dataset Comparer

Features

Quick Start

CLI Usage

API Reference

DatasetComparer Class

Comparison Methods

Key-Based Comparison

Position-Based Comparison

Output Format

Comparison Result

Changed Row Details

Schema Comparison

Filtering Options

Ignore Columns

Compare Specific Columns

Report Formats

HTML Report

CSV Export

JSON Output

Example Workflows

Data Migration Validation

ETL Pipeline Verification

Incremental Update Detection

Schema Change Detection

Large Dataset Tips

Dependencies

Related Skills

Xlsx

Clickhouse Io

Clickhouse Io

Analyzing Financial Statements

Data Storytelling

Kpi Dashboard Design

Dbt Transformation Patterns

Sql Optimization Patterns

Anndata

Xlsx

Skill Information