Data_Cleaning

by benchflow-ai

data

Clean messy tabular datasets with deduplication, missing value imputation, outlier handling, and text processing. Use when dealing with dirty data that has duplicates, nulls, or inconsistent formatting.

Skill Details

Repository Files

2 files in this skill directory


name: data_cleaning description: Clean messy tabular datasets with deduplication, missing value imputation, outlier handling, and text processing. Use when dealing with dirty data that has duplicates, nulls, or inconsistent formatting.

Data Cleaning Framework

Comprehensive, modular data cleaning framework for tabular datasets. Provides strategy-based cleaning operations including deduplication, missing value imputation, text processing, and outlier handling through a configurable pipeline.

Core Components

CleaningStrategies

Collection of static methods for specific cleaning operations:

Deduplication

  • remove_duplicates(df, subset=None) - Remove duplicate rows given the context of the prompt or the context of the dataset, Carefully evaluate if it still makes sense to allow duplicates exist in the downstream analysis. You should always check the data and remove them based on the logic of meta data and downstream goal.

Missing Value Handling

Logic: If the column is not critical to affect the dowstream task and the cost of dropping the whole line is not neccessary, you should try to impute them, otherwise you can drop.

  • drop_missing(df, columns) - Drop rows with missing critical values (that will affect the downstream task's peformance) in specified columns.
  • impute_median(df, columns) - Impute with median for numerical columns if the column is numerical data
  • impute_knn(df, target_features, n_neighbors=5) - KNN-based imputation such that the target column value potentially can be inferred by other columns. I.e. you have to carefully see the semantic and logical relationship between columns in terms of real life cases and make decisions to use this method for better data imputation performance.
  • impute_mode(df, columns) - Impute with mode for categorical columns if the target column is relatively independent to other columns and should just be filled with an acceptable value.

Text Processing

  • process_text(df, columns, operation) - Text operations: some columns might need conduct text related processing to get numbers or remove unnecessary content, you need to dertermine if this is needed based on preview the column values regarding the down stream goal. (extract_numbers, clean_whitespace, extract_email, lowercase, remove_special)

Outlier Handling

Logic: Again you need to make sure if removing outliers is costly or not in terms of the sample size or critical columns under your investigation. Keep the rows by replacing the outlier value with a more reasonable value, or simply remove the whole row in one of the two ways if it barely has any impact.

  • cap_outliers_iqr(df, columns, multiplier=1.5) - Winsorization using IQR to keep reasonable signals of the data;
  • remove_outliers_iqr(df, columns, multiplier=1.5) - Remove outliers using IQR
  • remove_outliers_zscore(df, columns, threshold=3.0) - Remove outliers using Z-score

DataCleaningPipeline

Orchestrates multiple cleaning steps with logging and progress tracking.

Usage

from data_cleaning import CleaningStrategies, DataCleaningPipeline

# Create pipeline
pipeline = DataCleaningPipeline(name="Transaction Data")

# Add cleaning steps
pipeline.add_step(
    CleaningStrategies.remove_duplicates,
    subset=['customer_id', 'order_date'],
    description="Remove duplicate orders"
).add_step(
    CleaningStrategies.drop_missing,
    columns=['customer_id', 'amount'],
    description="Drop rows with missing critical fields"
).add_step(
    CleaningStrategies.impute_knn,
    target_features={
        'income': {'features': ['age', 'education'], 'type': 'numeric'}
    },
    description="KNN imputation for income"
).add_step(
    CleaningStrategies.cap_outliers_iqr,
    columns=['amount'],
    multiplier=1.5,
    description="Cap price outliers"
)

# Execute pipeline
df_clean = pipeline.execute(raw_df, verbose=True)

# Get execution log
log_df = pipeline.get_log()

Input

  • Any tabular DataFrame requiring cleaning

Output

  • Cleaned DataFrame with execution log available

Best Practices

  • Chain multiple steps using method chaining
  • Use verbose=True to monitor pipeline progress
  • Review execution log for quality assurance
  • Extend CleaningStrategies with domain-specific methods as needed

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:1/29/2026