Dataset Comparer
by dkyazzentwatwa
Compare two datasets to find differences, added/removed rows, changed values. Use for data validation, ETL verification, or tracking changes.
Skill Details
Repository Files
3 files in this skill directory
name: dataset-comparer description: Compare two datasets to find differences, added/removed rows, changed values. Use for data validation, ETL verification, or tracking changes.
Dataset Comparer
Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.
Features
- Row Comparison: Find added, removed, and matching rows
- Value Changes: Detect changed values in matching rows
- Column Comparison: Identify schema differences
- Statistics: Summary of differences
- Diff Reports: HTML, CSV, and JSON output
- Flexible Matching: Compare by key columns or row position
Quick Start
from dataset_comparer import DatasetComparer
comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")
# Compare by key column
diff = comparer.compare(key_columns=["id"])
print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")
# Generate report
comparer.generate_report("diff_report.html")
CLI Usage
# Basic comparison
python dataset_comparer.py --old old.csv --new new.csv
# Compare by key column
python dataset_comparer.py --old old.csv --new new.csv --key id
# Multiple key columns
python dataset_comparer.py --old old.csv --new new.csv --key id,date
# Generate HTML report
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html
# Export differences to CSV
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv
# JSON output
python dataset_comparer.py --old old.csv --new new.csv --key id --json
# Ignore specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date
# Compare only specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
API Reference
DatasetComparer Class
class DatasetComparer:
def __init__(self)
# Data loading
def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
def load_dataframes(self, old_df: pd.DataFrame,
new_df: pd.DataFrame) -> 'DatasetComparer'
# Comparison
def compare(self, key_columns: list = None,
ignore_columns: list = None,
compare_columns: list = None) -> dict
# Detailed results
def get_added_rows(self) -> pd.DataFrame
def get_removed_rows(self) -> pd.DataFrame
def get_changed_rows(self) -> pd.DataFrame
def get_unchanged_rows(self) -> pd.DataFrame
# Schema comparison
def compare_schema(self) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def to_dataframe(self) -> pd.DataFrame
def summary(self) -> str
Comparison Methods
Key-Based Comparison
Compare rows by matching key columns (like primary keys):
diff = comparer.compare(key_columns=["customer_id"])
# Multiple keys for composite matching
diff = comparer.compare(key_columns=["order_id", "product_id"])
Position-Based Comparison
Compare rows by their position (row number):
diff = comparer.compare() # No keys = positional comparison
Output Format
Comparison Result
{
"summary": {
"old_rows": 1000,
"new_rows": 1050,
"added_count": 75,
"removed_count": 25,
"changed_count": 50,
"unchanged_count": 900,
"total_differences": 150
},
"schema_changes": {
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "amount", "old_type": "int64", "new_type": "float64"}
]
},
"key_columns": ["id"],
"compared_columns": ["name", "email", "status"],
"ignored_columns": ["updated_at"]
}
Changed Row Details
changes = comparer.get_changed_rows()
# Returns DataFrame with columns:
# _key: Key value(s) for the row
# _column: Column that changed
# _old_value: Original value
# _new_value: New value
Schema Comparison
Compare column structure:
schema = comparer.compare_schema()
# Returns:
{
"old_columns": ["id", "name", "old_field"],
"new_columns": ["id", "name", "new_field"],
"common_columns": ["id", "name"],
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "price", "old_type": "int64", "new_type": "float64"}
],
"old_row_count": 1000,
"new_row_count": 1050
}
Filtering Options
Ignore Columns
Skip certain columns during comparison:
diff = comparer.compare(
key_columns=["id"],
ignore_columns=["updated_at", "modified_by", "timestamp"]
)
Compare Specific Columns
Only compare selected columns:
diff = comparer.compare(
key_columns=["id"],
compare_columns=["name", "email", "status"] # Only these columns
)
Report Formats
HTML Report
comparer.generate_report("diff_report.html", format="html")
Features:
- Summary statistics
- Interactive tables
- Color-coded changes (green=added, red=removed, yellow=changed)
- Schema comparison section
CSV Export
comparer.generate_report("diff_report.csv", format="csv")
Includes all differences in tabular format.
JSON Output
comparer.generate_report("diff_report.json", format="json")
Complete diff data in JSON format.
Example Workflows
Data Migration Validation
comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")
diff = comparer.compare(key_columns=["id"])
if diff["summary"]["total_differences"] == 0:
print("Migration successful - no differences!")
else:
print(f"Found {diff['summary']['total_differences']} differences")
comparer.generate_report("migration_issues.html")
ETL Pipeline Verification
comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")
diff = comparer.compare(
key_columns=["transaction_id"],
ignore_columns=["processing_timestamp"]
)
# Check for unexpected changes
changed = comparer.get_changed_rows()
if len(changed) > 0:
print("Warning: Historical records changed!")
print(changed)
Incremental Update Detection
comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")
diff = comparer.compare(key_columns=["customer_id"])
# Get new records for processing
new_records = comparer.get_added_rows()
print(f"New records to process: {len(new_records)}")
# Get deleted records
deleted = comparer.get_removed_rows()
print(f"Records to deactivate: {len(deleted)}")
Schema Change Detection
comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")
schema = comparer.compare_schema()
if schema["added_columns"]:
print(f"New columns: {schema['added_columns']}")
if schema["removed_columns"]:
print(f"Removed columns: {schema['removed_columns']}")
if schema["type_changes"]:
for change in schema["type_changes"]:
print(f"Type change: {change['column']} "
f"({change['old_type']} -> {change['new_type']})")
Large Dataset Tips
For very large datasets:
# Compare in chunks
comparer = DatasetComparer()
comparer.load("large_old.csv", "large_new.csv")
# Use key column for efficient matching
diff = comparer.compare(key_columns=["id"])
# Export only differences (not full data)
comparer.generate_report("diff_only.csv", format="csv")
Dependencies
- pandas>=2.0.0
- numpy>=1.24.0
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
