Data Cleaner
by SPIRAL-EDWIN
Automated data cleaning and preprocessing for MCM/ICM competitions. Use when loading raw CSV/Excel data that needs missing value handling, outlier detection, normalization, and quality reporting. Produces clean datasets and statistical summaries ready for modeling.
Skill Details
Repository Files
1 file in this skill directory
name: data-cleaner description: Automated data cleaning and preprocessing for MCM/ICM competitions. Use when loading raw CSV/Excel data that needs missing value handling, outlier detection, normalization, and quality reporting. Produces clean datasets and statistical summaries ready for modeling.
Data-Cleaner
Systematically clean and preprocess raw data to ensure quality inputs for mathematical models.
When to Use
- After downloading or web scraping raw data
- Before any modeling or analysis
- When data has missing values, outliers, or inconsistent formats
- Need to document data quality for paper
Standard Cleaning Pipeline
Step 1: Load and Inspect
import pandas as pd
import numpy as np
def load_data(filepath):
"""Load data with automatic format detection"""
# Detect file type
if filepath.endswith('.csv'):
df = pd.read_csv(filepath)
elif filepath.endswith(('.xls', '.xlsx')):
df = pd.read_excel(filepath)
elif filepath.endswith('.json'):
df = pd.read_json(filepath)
else:
raise ValueError(f"Unsupported file format: {filepath}")
# Initial inspection
print(f"Data shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst few rows:\n{df.head()}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic statistics:\n{df.describe()}")
return df
Step 2: Handle Missing Values
def handle_missing_values(df, strategy='auto', threshold=0.5):
"""
Handle missing data with appropriate strategy
Parameters:
- strategy: 'auto', 'drop', 'mean', 'median', 'forward_fill', 'interpolate'
- threshold: Drop columns if missing > threshold fraction
"""
# Record initial state
initial_shape = df.shape
missing_report = df.isnull().sum()
# Drop columns with too many missing values
missing_fraction = df.isnull().sum() / len(df)
cols_to_drop = missing_fraction[missing_fraction > threshold].index
if len(cols_to_drop) > 0:
print(f"Dropping columns with >{threshold*100}% missing: {list(cols_to_drop)}")
df = df.drop(columns=cols_to_drop)
# Handle remaining missing values
for col in df.columns:
if df[col].isnull().any():
if strategy == 'auto':
# Auto-select strategy based on data type and distribution
if df[col].dtype in ['int64', 'float64']:
# Numeric: use median (robust to outliers)
fill_value = df[col].median()
df[col].fillna(fill_value, inplace=True)
print(f" {col}: filled {missing_report[col]} values with median ({fill_value:.2f})")
else:
# Categorical: use mode
fill_value = df[col].mode()[0]
df[col].fillna(fill_value, inplace=True)
print(f" {col}: filled {missing_report[col]} values with mode ('{fill_value}')")
elif strategy == 'mean':
df[col].fillna(df[col].mean(), inplace=True)
elif strategy == 'median':
df[col].fillna(df[col].median(), inplace=True)
elif strategy == 'forward_fill':
df[col].fillna(method='ffill', inplace=True)
elif strategy == 'interpolate':
df[col].interpolate(method='linear', inplace=True)
elif strategy == 'drop':
df = df.dropna(subset=[col])
print(f"\nShape after missing value handling: {initial_shape} → {df.shape}")
return df
Step 3: Detect and Handle Outliers
def detect_outliers(df, method='iqr', threshold=3):
"""
Detect outliers using IQR or Z-score method
Parameters:
- method: 'iqr' (Interquartile Range) or 'zscore'
- threshold: 3 for z-score, 1.5 for IQR (standard values)
"""
outlier_report = {}
for col in df.select_dtypes(include=[np.number]).columns:
if method == 'iqr':
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
elif method == 'zscore':
z_scores = np.abs((df[col] - df[col].mean()) / df[col].std())
outliers = df[z_scores > threshold]
if len(outliers) > 0:
outlier_report[col] = {
'count': len(outliers),
'percentage': len(outliers) / len(df) * 100,
'values': outliers[col].tolist()
}
print(f" {col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
return outlier_report
def handle_outliers(df, method='cap', outlier_report=None):
"""
Handle detected outliers
Parameters:
- method: 'cap' (winsorize), 'remove', or 'keep'
"""
if method == 'cap':
# Cap at 1st and 99th percentiles
for col in df.select_dtypes(include=[np.number]).columns:
lower = df[col].quantile(0.01)
upper = df[col].quantile(0.99)
df[col] = df[col].clip(lower, upper)
print("Outliers capped at 1st/99th percentiles")
elif method == 'remove':
# Remove rows with any outliers (use cautiously!)
initial_len = len(df)
for col in df.select_dtypes(include=[np.number]).columns:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
df = df[(df[col] >= Q1 - 1.5*IQR) & (df[col] <= Q3 + 1.5*IQR)]
print(f"Removed {initial_len - len(df)} rows with outliers")
elif method == 'keep':
print("Outliers kept (no action taken)")
return df
Step 4: Normalize and Standardize
def normalize_data(df, method='standard', columns=None):
"""
Normalize numeric columns
Parameters:
- method: 'standard' (z-score), 'minmax' (0-1), or 'robust'
- columns: List of columns to normalize (None = all numeric)
"""
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns
if method == 'standard':
scaler = StandardScaler()
elif method == 'minmax':
scaler = MinMaxScaler()
elif method == 'robust':
scaler = RobustScaler() # Robust to outliers
df[columns] = scaler.fit_transform(df[columns])
print(f"Normalized {len(columns)} columns using {method} scaling")
return df, scaler
Step 5: Data Type Conversion
def fix_data_types(df):
"""Convert columns to appropriate data types"""
for col in df.columns:
# Try to convert string dates to datetime
if df[col].dtype == 'object':
try:
df[col] = pd.to_datetime(df[col])
print(f" {col}: converted to datetime")
except:
pass
# Convert object columns with few unique values to category
if df[col].dtype == 'object':
n_unique = df[col].nunique()
if n_unique < 50: # Heuristic threshold
df[col] = df[col].astype('category')
print(f" {col}: converted to category ({n_unique} levels)")
return df
Complete Cleaning Function
def clean_data(filepath, output_path='data/processed.csv',
missing_strategy='auto', outlier_method='cap',
normalize=False):
"""
Complete data cleaning pipeline
Returns: cleaned DataFrame and cleaning report
"""
print("="*60)
print("DATA CLEANING PIPELINE")
print("="*60)
# Load
print("\n[1/6] Loading data...")
df = load_data(filepath)
# Missing values
print("\n[2/6] Handling missing values...")
df = handle_missing_values(df, strategy=missing_strategy)
# Data types
print("\n[3/6] Fixing data types...")
df = fix_data_types(df)
# Outliers
print("\n[4/6] Detecting outliers...")
outlier_report = detect_outliers(df, method='iqr')
print("\n[5/6] Handling outliers...")
df = handle_outliers(df, method=outlier_method)
# Normalize (optional)
if normalize:
print("\n[6/6] Normalizing data...")
df, scaler = normalize_data(df, method='standard')
else:
print("\n[6/6] Skipping normalization")
scaler = None
# Save
df.to_csv(output_path, index=False)
print(f"\nCleaned data saved to: {output_path}")
# Generate report
report = {
'original_shape': df.shape,
'final_shape': df.shape,
'columns': list(df.columns),
'dtypes': df.dtypes.to_dict(),
'missing_values': df.isnull().sum().to_dict(),
'outliers': outlier_report,
'summary_stats': df.describe().to_dict()
}
# Save report
import json
with open(output_path.replace('.csv', '_report.json'), 'w') as f:
json.dump(report, f, indent=2, default=str)
print("\nCleaning complete!")
print("="*60)
return df, report
Usage Example
# Clean data with default settings
df_clean, report = clean_data(
filepath='data/raw_data.csv',
output_path='data/processed.csv',
missing_strategy='auto',
outlier_method='cap',
normalize=False
)
# Access cleaned data
print(df_clean.head())
print(df_clean.info())
Output Files
Save to data/:
processed.csv- Cleaned datasetprocessed_report.json- Cleaning statistics and decisionsraw_data.csv- Original data (never modify!)
Quality Checklist
Before using cleaned data:
- No missing values remain (or documented why kept)
- Outliers handled appropriately (not just deleted blindly)
- Data types are correct (dates as datetime, categories as category)
- Column names are clean (no spaces, lowercase)
- Duplicates checked and removed if appropriate
- Ranges are reasonable (no negative ages, etc.)
Common Pitfalls
- Deleting too much: Removing outliers can lose real information
- Wrong imputation: Using mean for skewed data (use median)
- No documentation: Must record what cleaning was done
- Normalizing too early: Clean first, normalize last
- Ignoring domain knowledge: Some "outliers" are real (billionaires exist!)
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
