Data Quality
by NeverSight
Diagnose and fix data quality problems in datasets. Use when working with dirty data, finding duplicates, handling missing values, detecting outliers/anomalies, validating constraints (functional dependencies, referential integrity), profiling datasets, or cleaning data for analysis or ML. Covers the full data quality lifecycle - define, detect, clean, measure.
Skill Details
name: data-quality description: Diagnose and fix data quality problems in datasets. Use when working with dirty data, finding duplicates, handling missing values, detecting outliers/anomalies, validating constraints (functional dependencies, referential integrity), profiling datasets, or cleaning data for analysis or ML. Covers the full data quality lifecycle - define, detect, clean, measure.
Data Quality Skill
Systematic approach to diagnosing and fixing data quality problems.
Data Quality Process
Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
- Define: Understand data context, business rules, quality requirements
- Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
- Clean: Apply appropriate repair strategies
- Measure: Validate repairs, quantify improvement
Quick Reference
| Problem | Script | Key Function |
|---|---|---|
| Data overview | data_profiling.py |
profile_dataframe(df) |
| Find quality issues | data_profiling.py |
detect_glitches(df) |
| Missing values | missing_data.py |
analyze_missing(df) |
| Imputation | missing_data.py |
impute_mean/median/regression() |
| Duplicates | duplicate_detection.py |
find_duplicates(df, cols) |
| Deduplication | duplicate_detection.py |
deduplicate(df, cols) |
| Outliers | anomaly_detection.py |
detect_anomalies(df) |
| Constraint check | constraint_checking.py |
validate_constraints(df, rules) |
| String matching | similarity_metrics.py |
jaro_winkler_similarity() |
Workflow
Step 1: Profile the Data
from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report
# Quick overview
print(generate_quality_report(df))
# Detailed profile
profile = profile_dataframe(df)
# Find issues
glitches = detect_glitches(df)
Step 2: Analyze Specific Issues
Missing Data:
from scripts.missing_data import analyze_missing, test_mcar
analysis = analyze_missing(df)
# Check if safe to delete rows
mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])
Duplicates:
from scripts.duplicate_detection import find_duplicates, cluster_duplicates
matches = find_duplicates(df, ['name', 'email'], threshold=0.85)
clusters = cluster_duplicates(matches)
Outliers:
from scripts.anomaly_detection import detect_anomalies, iqr_outliers
# Multi-column summary
anomalies = detect_anomalies(df, method='iqr')
# Single column detail
result = iqr_outliers(df, 'price', multiplier=1.5)
Constraints:
from scripts.constraint_checking import validate_constraints
constraints = [
{'type': 'unique', 'columns': ['id']},
{'type': 'not_null', 'columns': ['name', 'email']},
{'type': 'fd', 'determinant': ['id'], 'dependent': ['name']},
{'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150},
]
results = validate_constraints(df, constraints)
Step 3: Clean the Data
Handle Missing:
from scripts.missing_data import impute_median, impute_regression, listwise_deletion
# Simple: median for numeric
df_clean = impute_median(df, 'age')
# Better: regression-based
df_clean = impute_regression(df, 'income', ['age', 'education'])
# If MCAR confirmed
df_clean = listwise_deletion(df)
Remove Duplicates:
from scripts.duplicate_detection import deduplicate
df_clean, summary = deduplicate(
df,
columns=['name', 'email', 'address'],
threshold=0.8,
merge_strategy='most_complete'
)
print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")
Handle Outliers:
# Cap extreme values
q01, q99 = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(q01, q99)
# Or remove
df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]
Step 4: Validate
Re-run profiling and constraint checks on cleaned data to verify improvements.
References
For deeper understanding:
- references/dimensions.md: Data quality dimensions (accuracy, completeness, etc.)
- references/glitch_taxonomy.md: Types of data glitches and detection approaches
- references/repair_strategies.md: Detailed repair and cleaning strategies
Key Concepts
Data Quality = Fit for Use
- Free of defects
- Has features needed for the task
- Right information, right place, right time
Missing Data Mechanisms:
- MCAR: Missing Completely At Random (safe to delete)
- MAR: Missing At Random (imputation may work)
- MNAR: Missing Not At Random (most problematic)
Constraints:
- Functional Dependency:
X → Ymeans X uniquely determines Y - Referential Integrity: foreign keys reference valid primary keys
- Domain Constraints: values within allowed set/range
Entity Resolution:
- Blocking reduces O(n²) to O(n·window)
- Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
- Cluster by transitive closure, merge by strategy
Similarity Metrics Comparison
| Metric | Best For | Example |
|---|---|---|
| Jaro-Winkler | Names, short strings | "Robert" vs "Rupert" |
| Levenshtein | Typos, edit distance | "recieve" vs "receive" |
| Jaccard | Token/word comparison | "John Doe" vs "Doe, John" |
| Q-gram | Fuzzy substring matching | Partial matches |
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
