Data Preprocessing
by AmnadTaowsoam
Comprehensive guide for data preprocessing patterns in ML, covering data cleaning, feature engineering, normalization, and pipeline creation.
Skill Details
Repository Files
1 file in this skill directory
name: Data Preprocessing description: Comprehensive guide for data preprocessing patterns in ML, covering data cleaning, feature engineering, normalization, and pipeline creation.
Data Preprocessing
Overview
Data preprocessing is a critical step in machine learning pipelines that transforms raw data into a format suitable for model training. This skill covers data cleaning, feature engineering, normalization, encoding categorical variables, scaling, augmentation, pipeline creation, and preprocessing for different data types.
Prerequisites
- Understanding of Python programming
- Knowledge of pandas and NumPy
- Familiarity with scikit-learn
- Understanding of machine learning concepts
- Basic knowledge of statistics
Key Concepts
Data Cleaning
- Missing Values: Handling null/NaN values through imputation or removal
- Outliers: Detecting and handling extreme values
- Duplicates: Identifying and removing duplicate records
- Data Validation: Ensuring data integrity and consistency
Feature Engineering
- Polynomial Features: Creating higher-order and interaction features
- Date/Time Features: Extracting temporal patterns from datetime columns
- Text Features: Converting text to numerical representations
- Ratio Features: Creating derived features from combinations
Normalization and Scaling
- Standardization: Z-score normalization (mean=0, std=1)
- Min-Max Scaling: Scaling to fixed range [0, 1]
- Robust Scaling: Using median and IQR for outlier-resistant scaling
- Normalization: Scaling individual samples to unit norm
Encoding Categorical Variables
- Label Encoding: Converting categories to integers
- One-Hot Encoding: Creating binary columns for each category
- Target Encoding: Using target mean for encoding
- Ordinal Encoding: Preserving order in categorical data
Implementation Guide
Data Cleaning
Missing Values
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
class MissingValueHandler:
"""Handle missing values in datasets."""
def __init__(self, strategy='mean', numeric_strategy='mean', categorical_strategy='most_frequent'):
self.strategy = strategy
self.numeric_strategy = numeric_strategy
self.categorical_strategy = categorical_strategy
self.numeric_imputer = None
self.categorical_imputer = None
def fit(self, X):
"""Fit imputers on data."""
if isinstance(X, pd.DataFrame):
numeric_cols = X.select_dtypes(include=[np.number]).columns
categorical_cols = X.select_dtypes(exclude=[np.number]).columns
else:
# Assume all columns are numeric for numpy arrays
numeric_cols = list(range(X.shape[1]))
categorical_cols = []
if len(numeric_cols) > 0:
self.numeric_imputer = SimpleImputer(strategy=self.numeric_strategy)
if isinstance(X, pd.DataFrame):
self.numeric_imputer.fit(X[numeric_cols])
else:
self.numeric_imputer.fit(X[:, numeric_cols])
if len(categorical_cols) > 0:
self.categorical_imputer = SimpleImputer(strategy=self.categorical_strategy)
self.categorical_imputer.fit(X[categorical_cols])
return self
def transform(self, X):
"""Transform data using fitted imputers."""
X_transformed = X.copy()
if isinstance(X, pd.DataFrame):
numeric_cols = X.select_dtypes(include=[np.number]).columns
categorical_cols = X.select_dtypes(exclude=[np.number]).columns
if self.numeric_imputer is not None and len(numeric_cols) > 0:
X_transformed[numeric_cols] = self.numeric_imputer.transform(X[numeric_cols])
if self.categorical_imputer is not None and len(categorical_cols) > 0:
X_transformed[categorical_cols] = self.categorical_imputer.transform(X[categorical_cols])
else:
if self.numeric_imputer is not None:
X_transformed = self.numeric_imputer.transform(X)
return X_transformed
def fit_transform(self, X):
"""Fit and transform in one step."""
return self.fit(X).transform(X)
# Usage
handler = MissingValueHandler(numeric_strategy='mean', categorical_strategy='most_frequent')
X_clean = handler.fit_transform(X_train)
# KNN Imputation for more sophisticated handling
knn_imputer = KNNImputer(n_neighbors=5)
X_knn = knn_imputer.fit_transform(X_train)
Outliers
from scipy import stats
from sklearn.preprocessing import RobustScaler
class OutlierHandler:
"""Detect and handle outliers."""
@staticmethod
def z_score_detection(X, threshold=3):
"""Detect outliers using Z-score."""
if isinstance(X, pd.DataFrame):
numeric_cols = X.select_dtypes(include=[np.number]).columns
z_scores = np.abs(stats.zscore(X[numeric_cols], nan_policy='omit'))
outliers = (z_scores > threshold).any(axis=1)
else:
z_scores = np.abs(stats.zscore(X, nan_policy='omit'))
outliers = (z_scores > threshold).any(axis=1)
return outliers
@staticmethod
def iqr_detection(X, multiplier=1.5):
"""Detect outliers using IQR method."""
if isinstance(X, pd.DataFrame):
numeric_cols = X.select_dtypes(include=[np.number]).columns
outliers = pd.Series(False, index=X.index)
for col in numeric_cols:
Q1 = X[col].quantile(0.25)
Q3 = X[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
outliers |= (X[col] < lower_bound) | (X[col] > upper_bound)
else:
Q1 = np.percentile(X, 25, axis=0)
Q3 = np.percentile(X, 75, axis=0)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
outliers = ((X < lower_bound) | (X > upper_bound)).any(axis=1)
return outliers
@staticmethod
def cap_outliers(X, method='iqr', multiplier=1.5):
"""Cap outliers to bounds instead of removing."""
if isinstance(X, pd.DataFrame):
X_capped = X.copy()
numeric_cols = X.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
if method == 'iqr':
Q1 = X[col].quantile(0.25)
Q3 = X[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
elif method == 'zscore':
mean = X[col].mean()
std = X[col].std()
lower_bound = mean - multiplier * std
upper_bound = mean + multiplier * std
X_capped[col] = X[col].clip(lower_bound, upper_bound)
return X_capped
else:
if method == 'iqr':
Q1 = np.percentile(X, 25, axis=0)
Q3 = np.percentile(X, 75, axis=0)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
elif method == 'zscore':
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
lower_bound = mean - multiplier * std
upper_bound = mean + multiplier * std
return np.clip(X, lower_bound, upper_bound)
# Usage
outlier_handler = OutlierHandler()
# Detect outliers
outliers = outlier_handler.z_score_detection(X_train, threshold=3)
# Cap outliers
X_capped = outlier_handler.cap_outliers(X_train, method='iqr')
# Remove outliers
X_clean = X_train[~outliers]
y_clean = y_train[~outliers]
Duplicates
class DuplicateHandler:
"""Handle duplicate rows."""
@staticmethod
def find_duplicates(X, subset=None):
"""Find duplicate rows."""
if isinstance(X, pd.DataFrame):
duplicates = X.duplicated(subset=subset, keep='first')
else:
# For numpy arrays
_, indices = np.unique(X, axis=0, return_index=True)
duplicates = np.ones(len(X), dtype=bool)
duplicates[indices] = False
return duplicates
@staticmethod
def remove_duplicates(X, y=None, subset=None):
"""Remove duplicate rows."""
if isinstance(X, pd.DataFrame):
if y is not None:
df = X.copy()
df['target'] = y
df_clean = df.drop_duplicates(subset=subset, keep='first')
return df_clean.drop(columns=['target']), df_clean['target']
else:
return X.drop_duplicates(subset=subset, keep='first')
else:
if y is not None:
combined = np.column_stack([X, y])
_, indices = np.unique(combined, axis=0, return_index=True)
return X[indices], y[indices]
else:
_, indices = np.unique(X, axis=0, return_index=True)
return X[indices]
# Usage
dup_handler = DuplicateHandler()
# Find duplicates
duplicates = dup_handler.find_duplicates(X_train)
# Remove duplicates
X_clean, y_clean = dup_handler.remove_duplicates(X_train, y_train)
Feature Engineering
Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
class FeatureEngineer:
"""Create new features from existing ones."""
def __init__(self):
self.polynomial_features = None
def create_polynomial_features(self, X, degree=2, include_bias=False):
"""Create polynomial features."""
self.polynomial_features = PolynomialFeatures(
degree=degree,
include_bias=include_bias,
interaction_only=False
)
return self.polynomial_features.fit_transform(X)
def create_interaction_features(self, X):
"""Create pairwise interaction features."""
n_features = X.shape[1]
interaction_features = []
for i in range(n_features):
for j in range(i + 1, n_features):
interaction_features.append(X[:, i] * X[:, j])
return np.column_stack([X] + interaction_features)
def create_ratio_features(self, X, pairs):
"""Create ratio features from column pairs."""
ratio_features = X.copy()
for col1, col2 in pairs:
# Avoid division by zero
ratio = np.divide(X[:, col1], X[:, col2],
out=np.zeros_like(X[:, col1]),
where=X[:, col2] != 0)
ratio_features = np.column_stack([ratio_features, ratio])
return ratio_features
def create_bin_features(self, X, bins=10, strategy='uniform'):
"""Create binned features."""
from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(
n_bins=bins,
encode='onehot',
strategy=strategy
)
return discretizer.fit_transform(X)
# Usage
engineer = FeatureEngineer()
# Polynomial features
X_poly = engineer.create_polynomial_features(X_train, degree=2)
# Interaction features
X_interaction = engineer.create_interaction_features(X_train)
# Ratio features
X_ratio = engineer.create_ratio_features(X_train, pairs=[(0, 1), (0, 2)])
Date/Time Features
import pandas as pd
from datetime import datetime
class DateTimeFeatureEngineer:
"""Extract features from datetime columns."""
@staticmethod
def extract_features(X, datetime_cols):
"""Extract features from datetime columns."""
if isinstance(X, pd.DataFrame):
X_features = X.copy()
for col in datetime_cols:
if col in X.columns:
# Convert to datetime if not already
if not pd.api.types.is_datetime64_any_dtype(X[col]):
X_features[col] = pd.to_datetime(X[col])
# Extract features
X_features[f'{col}_year'] = X_features[col].dt.year
X_features[f'{col}_month'] = X_features[col].dt.month
X_features[f'{col}_day'] = X_features[col].dt.day
X_features[f'{col}_dayofweek'] = X_features[col].dt.dayofweek
X_features[f'{col}_dayofyear'] = X_features[col].dt.dayofyear
X_features[f'{col}_weekofyear'] = X_features[col].dt.isocalendar().week
X_features[f'{col}_hour'] = X_features[col].dt.hour
X_features[f'{col}_minute'] = X_features[col].dt.minute
X_features[f'{col}_is_weekend'] = (X_features[col].dt.dayofweek >= 5).astype(int)
X_features[f'{col}_is_month_start'] = (X_features[col].dt.day <= 7).astype(int)
X_features[f'{col}_is_month_end'] = (X_features[col].dt.day >= 24).astype(int)
# Cyclical features
X_features[f'{col}_month_sin'] = np.sin(2 * np.pi * X_features[col].dt.month / 12)
X_features[f'{col}_month_cos'] = np.cos(2 * np.pi * X_features[col].dt.month / 12)
X_features[f'{col}_dayofweek_sin'] = np.sin(2 * np.pi * X_features[col].dt.dayofweek / 7)
X_features[f'{col}_dayofweek_cos'] = np.cos(2 * np.pi * X_features[col].dt.dayofweek / 7)
# Drop original datetime columns
X_features = X_features.drop(columns=datetime_cols)
return X_features
else:
raise ValueError("DateTimeFeatureEngineer only supports pandas DataFrames")
# Usage
dt_engineer = DateTimeFeatureEngineer()
X_dt_features = dt_engineer.extract_features(X_train, datetime_cols=['date_column'])
Text Features
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re
class TextFeatureEngineer:
"""Extract features from text columns."""
def __init__(self):
self.tfidf_vectorizer = None
self.count_vectorizer = None
def create_tfidf_features(self, texts, max_features=1000, ngram_range=(1, 2)):
"""Create TF-IDF features."""
self.tfidf_vectorizer = TfidfVectorizer(
max_features=max_features,
ngram_range=ngram_range,
stop_words='english',
lowercase=True
)
return self.tfidf_vectorizer.fit_transform(texts)
def create_count_features(self, texts, max_features=1000, ngram_range=(1, 1)):
"""Create count features."""
self.count_vectorizer = CountVectorizer(
max_features=max_features,
ngram_range=ngram_range,
stop_words='english',
lowercase=True
)
return self.count_vectorizer.fit_transform(texts)
def create_basic_features(self, texts):
"""Create basic text features."""
features = []
for text in texts:
# Length features
features.append([
len(text), # Character count
len(text.split()), # Word count
len(text.splitlines()), # Sentence count
sum(1 for c in text if c.isupper()), # Uppercase count
sum(1 for c in text if c.islower()), # Lowercase count
sum(1 for c in text if c.isdigit()), # Digit count
sum(1 for c in text if c in '.,!?;:'), # Punctuation count
text.count(' '), # Space count
len(set(text.split())) / max(len(text.split()), 1) # Unique word ratio
])
return np.array(features)
def clean_text(self, text):
"""Clean text."""
# Remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Convert to lowercase
text = text.lower()
# Remove extra whitespace
text = ' '.join(text.split())
return text
# Usage
text_engineer = TextFeatureEngineer()
# TF-IDF features
X_tfidf = text_engineer.create_tfidf_features(text_data)
# Basic features
X_basic = text_engineer.create_basic_features(text_data)
Data Normalization/Standardization
Standardization
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
class DataScaler:
"""Scale and normalize data."""
def __init__(self, method='standard'):
self.method = method
self.scaler = None
if method == 'standard':
self.scaler = StandardScaler()
elif method == 'minmax':
self.scaler = MinMaxScaler()
elif method == 'robust':
self.scaler = RobustScaler()
elif method == 'maxabs':
self.scaler = MaxAbsScaler()
else:
raise ValueError(f"Unknown scaling method: {method}")
def fit(self, X):
"""Fit scaler on data."""
self.scaler.fit(X)
return self
def transform(self, X):
"""Transform data using fitted scaler."""
return self.scaler.transform(X)
def fit_transform(self, X):
"""Fit and transform in one step."""
return self.scaler.fit_transform(X)
def inverse_transform(self, X):
"""Inverse transform scaled data."""
return self.scaler.inverse_transform(X)
# Usage
scaler = DataScaler(method='standard')
X_scaled = scaler.fit_transform(X_train)
# Transform test data
X_test_scaled = scaler.transform(X_test)
Normalization
from sklearn.preprocessing import Normalizer
class DataNormalizer:
"""Normalize samples individually."""
def __init__(self, norm='l2'):
self.norm = norm
self.normalizer = Normalizer(norm=norm)
def fit(self, X):
"""Fit normalizer (no-op for Normalizer)."""
return self
def transform(self, X):
"""Transform data using normalizer."""
return self.normalizer.transform(X)
def fit_transform(self, X):
"""Fit and transform in one step."""
return self.normalizer.fit_transform(X)
# Usage
normalizer = DataNormalizer(norm='l2')
X_normalized = normalizer.fit_transform(X_train)
Encoding Categorical Variables
Label Encoding
from sklearn.preprocessing import LabelEncoder
import pandas as pd
class CategoricalEncoder:
"""Encode categorical variables."""
def __init__(self, method='label'):
self.method = method
self.encoders = {}
self.onehot_encoder = None
def fit(self, X, categorical_cols=None):
"""Fit encoders on data."""
if isinstance(X, pd.DataFrame):
if categorical_cols is None:
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
if self.method == 'label':
encoder = LabelEncoder()
encoder.fit(X[col].astype(str))
self.encoders[col] = encoder
elif self.method == 'onehot':
from sklearn.preprocessing import OneHotEncoder
self.onehot_encoder = OneHotEncoder(
sparse_output=False,
handle_unknown='ignore'
)
self.onehot_encoder.fit(X[categorical_cols])
else:
raise ValueError("CategoricalEncoder only supports pandas DataFrames")
return self
def transform(self, X):
"""Transform data using fitted encoders."""
if isinstance(X, pd.DataFrame):
X_transformed = X.copy()
if self.method == 'label':
for col, encoder in self.encoders.items():
X_transformed[col] = encoder.transform(X[col].astype(str))
elif self.method == 'onehot' and self.onehot_encoder is not None:
categorical_cols = list(self.encoders.keys())
onehot_features = self.onehot_encoder.transform(X[categorical_cols])
feature_names = self.onehot_encoder.get_feature_names_out(categorical_cols)
# Drop original categorical columns
X_transformed = X_transformed.drop(columns=categorical_cols)
# Add one-hot encoded columns
onehot_df = pd.DataFrame(onehot_features, columns=feature_names, index=X.index)
X_transformed = pd.concat([X_transformed, onehot_df], axis=1)
return X_transformed
else:
raise ValueError("CategoricalEncoder only supports pandas DataFrames")
def fit_transform(self, X, categorical_cols=None):
"""Fit and transform in one step."""
return self.fit(X, categorical_cols).transform(X)
# Usage
encoder = CategoricalEncoder(method='label')
X_encoded = encoder.fit_transform(X_train, categorical_cols=['category_col'])
# One-hot encoding
encoder_onehot = CategoricalEncoder(method='onehot')
X_onehot = encoder_onehot.fit_transform(X_train, categorical_cols=['category_col'])
Target Encoding
from sklearn.model_selection import KFold
import numpy as np
class TargetEncoder:
"""Target encoding for categorical variables."""
def __init__(self, smoothing=1.0, min_samples_leaf=1):
self.smoothing = smoothing
self.min_samples_leaf = min_samples_leaf
self.encodings = {}
self.global_mean = None
def fit(self, X, y, categorical_cols=None):
"""Fit target encoder."""
if isinstance(X, pd.DataFrame):
if categorical_cols is None:
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
self.global_mean = y.mean()
for col in categorical_cols:
# Calculate mean target per category
category_means = y.groupby(X[col]).mean()
category_counts = X[col].value_counts()
# Apply smoothing
smoothing_factor = 1 / (1 + np.exp(-(category_counts - self.min_samples_leaf) / self.smoothing))
smoothed_means = self.global_mean * (1 - smoothing_factor) + category_means * smoothing_factor
self.encodings[col] = smoothed_means
else:
raise ValueError("TargetEncoder only supports pandas DataFrames")
return self
def transform(self, X):
"""Transform data using fitted encoder."""
if isinstance(X, pd.DataFrame):
X_transformed = X.copy()
for col, encoding in self.encodings.items():
X_transformed[f'{col}_encoded'] = X[col].map(encoding).fillna(self.global_mean)
return X_transformed
else:
raise ValueError("TargetEncoder only supports pandas DataFrames")
def fit_transform(self, X, y, categorical_cols=None):
"""Fit and transform in one step."""
return self.fit(X, y, categorical_cols).transform(X)
# Usage
target_encoder = TargetEncoder(smoothing=1.0, min_samples_leaf=10)
X_encoded = target_encoder.fit_transform(X_train, y_train, categorical_cols=['category_col'])
Feature Scaling
Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
class MinMaxScalerCustom:
"""Custom Min-Max scaling."""
def __init__(self, feature_range=(0, 1)):
self.feature_range = feature_range
self.min_ = None
self.max_ = None
self.scale_ = None
def fit(self, X):
"""Fit scaler on data."""
self.min_ = np.min(X, axis=0)
self.max_ = np.max(X, axis=0)
data_range = self.max_ - self.min_
data_range[data_range == 0] = 1 # Avoid division by zero
self.scale_ = (self.feature_range[1] - self.feature_range[0]) / data_range
return self
def transform(self, X):
"""Transform data."""
X_scaled = (X - self.min_) * self.scale_
X_scaled += self.feature_range[0]
return X_scaled
def fit_transform(self, X):
"""Fit and transform in one step."""
return self.fit(X).transform(X)
# Usage
scaler = MinMaxScalerCustom(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X_train)
Robust Scaling
from sklearn.preprocessing import RobustScaler
class RobustScalerCustom:
"""Robust scaling using median and IQR."""
def __init__(self, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0)):
self.with_centering = with_centering
self.with_scaling = with_scaling
self.quantile_range = quantile_range
self.center_ = None
self.scale_ = None
def fit(self, X):
"""Fit scaler on data."""
if self.with_centering:
self.center_ = np.median(X, axis=0)
if self.with_scaling:
q_min, q_max = self.quantile_range
q1 = np.percentile(X, q_min, axis=0)
q3 = np.percentile(X, q_max, axis=0)
iqr = q3 - q1
iqr[iqr == 0] = 1 # Avoid division by zero
self.scale_ = iqr
return self
def transform(self, X):
"""Transform data."""
X_scaled = X.copy()
if self.with_centering:
X_scaled -= self.center_
if self.with_scaling:
X_scaled /= self.scale_
return X_scaled
def fit_transform(self, X):
"""Fit and transform in one step."""
return self.fit(X).transform(X)
# Usage
robust_scaler = RobustScalerCustom()
X_scaled = robust_scaler.fit_transform(X_train)
Pipeline Creation
Scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
def create_preprocessing_pipeline(numeric_features, categorical_features):
"""Create sklearn preprocessing pipeline."""
# Numeric preprocessing
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
# Categorical preprocessing
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Column transformer
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
return preprocessor
# Usage
numeric_features = ['age', 'income', 'score']
categorical_features = ['gender', 'education', 'city']
preprocessor = create_preprocessing_pipeline(numeric_features, categorical_features)
# Fit and transform
X_processed = preprocessor.fit_transform(X_train)
# Transform test data
X_test_processed = preprocessor.transform(X_test)
Custom Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class CustomPreprocessor(BaseEstimator, TransformerMixin):
"""Custom preprocessing pipeline."""
def __init__(self, steps=None):
self.steps = steps or []
def add_step(self, name, transformer):
"""Add preprocessing step."""
self.steps.append((name, transformer))
return self
def fit(self, X, y=None):
"""Fit all transformers."""
for name, transformer in self.steps:
transformer.fit(X, y)
return self
def transform(self, X):
"""Transform data through all steps."""
X_transformed = X.copy()
for name, transformer in self.steps:
X_transformed = transformer.transform(X_transformed)
return X_transformed
def fit_transform(self, X, y=None):
"""Fit and transform in one step."""
return self.fit(X, y).transform(X)
# Usage
preprocessor = CustomPreprocessor()
preprocessor.add_step('missing_values', MissingValueHandler())
preprocessor.add_step('scaler', DataScaler(method='standard'))
preprocessor.add_step('encoder', CategoricalEncoder(method='label'))
X_processed = preprocessor.fit_transform(X_train)
Preprocessing for Different Data Types
Image Preprocessing
import torchvision.transforms as transforms
from PIL import Image
import numpy as np
class ImagePreprocessor:
"""Preprocess images for ML."""
def __init__(self, image_size=(224, 224), normalize=True, augment=False):
self.image_size = image_size
self.normalize = normalize
self.augment = augment
# Base transforms
transform_list = [
transforms.Resize(image_size),
transforms.ToTensor()
]
# Add normalization
if normalize:
transform_list.append(
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
)
# Add augmentation
if augment:
transform_list.insert(1, transforms.RandomHorizontalFlip(p=0.5))
transform_list.insert(2, transforms.RandomRotation(degrees=15))
transform_list.insert(3, transforms.ColorJitter(
brightness=0.2, contrast=0.2, saturation=0.2
))
self.transform = transforms.Compose(transform_list)
def preprocess(self, image):
"""Preprocess single image."""
if isinstance(image, str):
image = Image.open(image).convert('RGB')
elif isinstance(image, np.ndarray):
image = Image.fromarray(image)
return self.transform(image)
def preprocess_batch(self, images):
"""Preprocess batch of images."""
return torch.stack([self.preprocess(img) for img in images])
# Usage
preprocessor = ImagePreprocessor(image_size=(224, 224), augment=True)
processed_image = preprocessor.preprocess("image.jpg")
Text Preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
class TextPreprocessor:
"""Preprocess text for ML."""
def __init__(self, remove_stopwords=True, lemmatize=True, lowercase=True):
self.remove_stopwords = remove_stopwords
self.lemmatize = lemmatize
self.lowercase = lowercase
if remove_stopwords:
nltk.download('stopwords')
self.stop_words = set(stopwords.words('english'))
if lemmatize:
nltk.download('wordnet')
self.lemmatizer = WordNetLemmatizer()
def clean_text(self, text):
"""Clean text."""
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove email addresses
text = re.sub(r'\S*@\S*', '', text)
# Remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra whitespace
text = ' '.join(text.split())
return text
def tokenize(self, text):
"""Tokenize text."""
return text.split()
def preprocess(self, text):
"""Complete preprocessing pipeline."""
# Clean text
text = self.clean_text(text)
# Lowercase
if self.lowercase:
text = text.lower()
# Tokenize
tokens = self.tokenize(text)
# Remove stopwords
if self.remove_stopwords:
tokens = [token for token in tokens if token not in self.stop_words]
# Lemmatize
if self.lemmatize:
tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
return ' '.join(tokens)
# Usage
preprocessor = TextPreprocessor(remove_stopwords=True, lemmatize=True)
processed_text = preprocessor.preprocess("This is a sample text for preprocessing!")
Tabular Preprocessing
import pandas as pd
import numpy as np
class TabularPreprocessor:
"""Preprocess tabular data for ML."""
def __init__(self):
self.numeric_features = None
self.categorical_features = None
self.datetime_features = None
def identify_features(self, X):
"""Identify feature types."""
if isinstance(X, pd.DataFrame):
self.numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
self.categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
self.datetime_features = X.select_dtypes(include=['datetime64']).columns.tolist()
return {
'numeric': self.numeric_features,
'categorical': self.categorical_features,
'datetime': self.datetime_features
}
def preprocess(self, X, handle_missing=True, scale=True, encode=True):
"""Complete preprocessing pipeline."""
X_processed = X.copy()
# Identify features
feature_types = self.identify_features(X)
# Handle missing values
if handle_missing:
handler = MissingValueHandler()
X_processed = handler.fit_transform(X_processed)
# Encode categorical variables
if encode and self.categorical_features:
encoder = CategoricalEncoder(method='onehot')
X_processed = encoder.fit_transform(X_processed, self.categorical_features)
# Scale numeric features
if scale and self.numeric_features:
scaler = DataScaler(method='standard')
if isinstance(X_processed, pd.DataFrame):
numeric_cols = X_processed.select_dtypes(include=[np.number]).columns
X_processed[numeric_cols] = scaler.fit_transform(X_processed[numeric_cols])
else:
X_processed = scaler.fit_transform(X_processed)
return X_processed
# Usage
preprocessor = TabularPreprocessor()
X_processed = preprocessor.preprocess(X_train)
Reproducibility
Random Seed Setting
import random
import numpy as np
import torch
def set_seed(seed=42):
"""Set random seed for reproducibility."""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Usage
set_seed(42)
Deterministic Preprocessing
class DeterministicPreprocessor:
"""Deterministic preprocessing for reproducibility."""
def __init__(self, seed=42):
self.seed = seed
set_seed(seed)
def train_test_split(self, X, y, test_size=0.2, random_state=None):
"""Deterministic train-test split."""
from sklearn.model_selection import train_test_split
return train_test_split(
X, y,
test_size=test_size,
random_state=random_state or self.seed,
stratify=y
)
def kfold_split(self, X, y, n_splits=5, random_state=None):
"""Deterministic K-Fold split."""
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(
n_splits=n_splits,
shuffle=True,
random_state=random_state or self.seed
)
return kfold.split(X, y)
# Usage
preprocessor = DeterministicPreprocessor(seed=42)
X_train, X_test, y_train, y_test = preprocessor.train_test_split(X, y)
Testing Preprocessing
Unit Tests
import unittest
import pandas as pd
import numpy as np
class TestPreprocessing(unittest.TestCase):
"""Unit tests for preprocessing."""
def setUp(self):
"""Set up test data."""
self.X = pd.DataFrame({
'numeric': [1, 2, 3, 4, 5],
'categorical': ['A', 'B', 'A', 'B', 'A'],
'missing': [1, np.nan, 3, np.nan, 5]
})
def test_missing_value_handler(self):
"""Test missing value handler."""
handler = MissingValueHandler()
X_clean = handler.fit_transform(self.X)
# Check no missing values
self.assertFalse(X_clean.isnull().any().any())
def test_data_scaler(self):
"""Test data scaler."""
scaler = DataScaler(method='standard')
X_scaled = scaler.fit_transform(self.X[['numeric']])
# Check mean is approximately 0
self.assertAlmostEqual(X_scaled.mean(), 0, places=5)
# Check std is approximately 1
self.assertAlmostEqual(X_scaled.std(), 1, places=5)
def test_categorical_encoder(self):
"""Test categorical encoder."""
encoder = CategoricalEncoder(method='label')
X_encoded = encoder.fit_transform(self.X, ['categorical'])
# Check categorical column is numeric
self.assertTrue(pd.api.types.is_numeric_dtype(X_encoded['categorical']))
if __name__ == '__main__':
unittest.main()
Best Practices
-
Understand Your Data
- Perform exploratory data analysis (EDA)
- Check data types and distributions
- Identify missing values and outliers
- Understand feature relationships
-
Handle Missing Values Appropriately
- Use mean/median imputation for numeric data
- Use most frequent imputation for categorical data
- Consider KNN imputation for complex patterns
- Document imputation strategy
-
Detect and Handle Outliers
- Use Z-score for normally distributed data
- Use IQR for non-normal distributions
- Consider domain knowledge before removing outliers
- Cap outliers instead of removing when appropriate
-
Choose Right Encoding Method
- Label encoding for ordinal variables
- One-hot encoding for nominal variables
- Target encoding for high-cardinality categorical variables
- Avoid creating too many one-hot features
-
Scale Features Consistently
- Fit scalers on training data only
- Use same scaler for train and test data
- Consider robust scaling for data with outliers
- Store scaler parameters for inference
-
Create Reusable Pipelines
- Build modular preprocessing steps
- Use sklearn Pipeline for reproducibility
- Document each preprocessing step
- Version control preprocessing code
-
Ensure Reproducibility
- Set random seeds consistently
- Use deterministic algorithms when possible
- Save preprocessing parameters
- Log preprocessing steps
-
Test Preprocessing
- Write unit tests for preprocessing functions
- Validate output shapes and types
- Check for data leakage
- Monitor preprocessing performance
-
Handle Different Data Types
- Use appropriate preprocessing for each data type
- Consider multimodal data preprocessing
- Preserve information during transformation
- Document data type handling
-
Monitor and Iterate
- Track preprocessing impact on model performance
- A/B test different preprocessing strategies
- Monitor preprocessing pipeline performance
- Continuously improve based on results
Related Skills
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
