Data Profiler
by benchflow-ai
Profile datasets to understand schema, quality, and characteristics. Use when analyzing data files (CSV, JSON, Parquet), discovering dataset properties, assessing data quality, or when user mentions data profiling, schema detection, data analysis, or quality metrics. Provides basic and intermediate profiling including distributions, uniqueness, and pattern detection.
Skill Details
Repository Files
2 files in this skill directory
name: data-profiler description: Profile datasets to understand schema, quality, and characteristics. Use when analyzing data files (CSV, JSON, Parquet), discovering dataset properties, assessing data quality, or when user mentions data profiling, schema detection, data analysis, or quality metrics. Provides basic and intermediate profiling including distributions, uniqueness, and pattern detection. version: 1.0.0
Data Profiler Skill
Overview
This skill provides comprehensive data profiling capabilities at Basic and Intermediate levels. It analyzes datasets to extract schema information, statistical summaries, data quality metrics, distributions, uniqueness characteristics, and pattern detection.
Profiling Levels
Basic Level
- Row count and column count
- Column names and data types
- Null/missing value counts and percentages
- Min, max, mean, median for numeric columns
- String length statistics for text columns
Intermediate Level
- Distributions: Value frequencies, histograms, percentiles
- Uniqueness: Distinct counts, cardinality, duplicate detection
- Pattern Detection: Date/time formats, regex patterns, data format identification
- Quality Metrics: Completeness scores, consistency indicators
Supported Formats
- CSV (Comma-Separated Values)
- JSON (JSON Lines and standard JSON)
- Parquet (Apache Parquet columnar format)
When to Use
- User requests data profiling or data analysis
- Need to understand dataset structure and quality
- Discovering schema for unknown datasets
- Assessing data quality before ingestion
- Identifying data issues (nulls, duplicates, anomalies)
- Creating dataset specifications or documentation
Core Profiling Functions
1. Schema Detection
import pandas as pd
def detect_schema(df: pd.DataFrame) -> dict:
"""
Detect schema information from DataFrame.
Returns:
Dictionary with column names, types, nullable status
"""
schema = {
'columns': []
}
for col in df.columns:
col_info = {
'name': col,
'type': str(df[col].dtype),
'nullable': df[col].isnull().any(),
'null_count': int(df[col].isnull().sum()),
'null_percentage': float(df[col].isnull().sum() / len(df) * 100)
}
# Infer semantic type
if pd.api.types.is_numeric_dtype(df[col]):
col_info['semantic_type'] = 'numeric'
elif pd.api.types.is_datetime64_dtype(df[col]):
col_info['semantic_type'] = 'datetime'
elif pd.api.types.is_bool_dtype(df[col]):
col_info['semantic_type'] = 'boolean'
else:
col_info['semantic_type'] = 'string'
schema['columns'].append(col_info)
return schema
2. Basic Statistics
def get_basic_statistics(df: pd.DataFrame) -> dict:
"""
Get basic statistical summary.
Returns:
Dictionary with row count, column count, and column stats
"""
stats = {
'row_count': len(df),
'column_count': len(df.columns),
'columns': {}
}
for col in df.columns:
col_stats = {}
if pd.api.types.is_numeric_dtype(df[col]):
col_stats = {
'min': float(df[col].min()) if not df[col].isna().all() else None,
'max': float(df[col].max()) if not df[col].isna().all() else None,
'mean': float(df[col].mean()) if not df[col].isna().all() else None,
'median': float(df[col].median()) if not df[col].isna().all() else None,
'std': float(df[col].std()) if not df[col].isna().all() else None
}
elif pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
non_null = df[col].dropna()
if len(non_null) > 0:
col_stats = {
'min_length': int(non_null.astype(str).str.len().min()),
'max_length': int(non_null.astype(str).str.len().max()),
'avg_length': float(non_null.astype(str).str.len().mean())
}
stats['columns'][col] = col_stats
return stats
3. Distribution Analysis
def analyze_distributions(df: pd.DataFrame, max_unique: int = 50) -> dict:
"""
Analyze value distributions for each column.
Args:
df: DataFrame to analyze
max_unique: Max unique values to show frequencies for
Returns:
Dictionary with distribution info per column
"""
distributions = {}
for col in df.columns:
dist_info = {}
# Value counts
value_counts = df[col].value_counts()
unique_count = len(value_counts)
dist_info['unique_count'] = unique_count
dist_info['unique_percentage'] = (unique_count / len(df)) * 100
# Show top values if not too many unique
if unique_count <= max_unique:
dist_info['value_frequencies'] = {
str(k): int(v) for k, v in value_counts.head(20).items()
}
# Percentiles for numeric columns
if pd.api.types.is_numeric_dtype(df[col]):
percentiles = df[col].quantile([0.25, 0.50, 0.75, 0.90, 0.95, 0.99])
dist_info['percentiles'] = {
f'p{int(p*100)}': float(v) for p, v in percentiles.items()
}
distributions[col] = dist_info
return distributions
4. Uniqueness Analysis
def analyze_uniqueness(df: pd.DataFrame) -> dict:
"""
Analyze uniqueness characteristics of columns.
Returns:
Dictionary with uniqueness metrics per column
"""
uniqueness = {}
for col in df.columns:
unique_info = {}
total_count = len(df[col])
non_null_count = df[col].notna().sum()
unique_count = df[col].nunique()
duplicate_count = total_count - unique_count
unique_info['distinct_count'] = unique_count
unique_info['duplicate_count'] = duplicate_count
unique_info['uniqueness_ratio'] = unique_count / total_count if total_count > 0 else 0
unique_info['is_unique_key'] = (unique_count == non_null_count)
unique_info['has_duplicates'] = (duplicate_count > 0)
# Find duplicated values (top 10)
if duplicate_count > 0:
duplicates = df[col].value_counts()
duplicates = duplicates[duplicates > 1].head(10)
unique_info['top_duplicates'] = {
str(k): int(v) for k, v in duplicates.items()
}
uniqueness[col] = unique_info
return uniqueness
5. Pattern Detection
import re
def detect_patterns(df: pd.DataFrame) -> dict:
"""
Detect common patterns in string columns.
Returns:
Dictionary with detected patterns per column
"""
patterns = {}
# Common regex patterns
pattern_regexes = {
'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
'phone_us': r'^\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$',
'url': r'^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b',
'ipv4': r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$',
'date_iso': r'^\d{4}-\d{2}-\d{2}',
'date_us': r'^\d{1,2}\/\d{1,2}\/\d{2,4}$',
'uuid': r'^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$',
'numeric': r'^\d+(\.\d+)?$',
'alphanumeric': r'^[a-zA-Z0-9]+$'
}
for col in df.columns:
if pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
col_patterns = {}
non_null = df[col].dropna().astype(str)
if len(non_null) == 0:
continue
# Test each pattern
for pattern_name, pattern_regex in pattern_regexes.items():
matches = non_null.str.match(pattern_regex, flags=re.IGNORECASE)
match_count = matches.sum()
match_percentage = (match_count / len(non_null)) * 100
if match_percentage > 50: # If >50% match, consider it a pattern
col_patterns[pattern_name] = {
'match_count': int(match_count),
'match_percentage': float(match_percentage)
}
if col_patterns:
patterns[col] = col_patterns
return patterns
6. Data Quality Metrics
def calculate_quality_metrics(df: pd.DataFrame) -> dict:
"""
Calculate overall data quality metrics.
Returns:
Dictionary with quality scores and metrics
"""
quality = {
'overall_completeness': 0.0,
'columns': {}
}
total_cells = len(df) * len(df.columns)
non_null_cells = df.notna().sum().sum()
quality['overall_completeness'] = (non_null_cells / total_cells) * 100 if total_cells > 0 else 0
for col in df.columns:
col_quality = {}
# Completeness
total_count = len(df[col])
non_null_count = df[col].notna().sum()
col_quality['completeness'] = (non_null_count / total_count) * 100 if total_count > 0 else 0
# Consistency (e.g., no mixed types)
if pd.api.types.is_numeric_dtype(df[col]):
col_quality['type_consistency'] = 100.0 # All numeric
elif pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
# Check if all non-null values are same type
non_null = df[col].dropna()
if len(non_null) > 0:
types = non_null.apply(type)
col_quality['type_consistency'] = (types.value_counts().max() / len(non_null)) * 100
quality['columns'][col] = col_quality
return quality
Complete Profiling Example
import pandas as pd
from typing import Dict, Any
def profile_dataset(df: pd.DataFrame) -> Dict[str, Any]:
"""
Complete dataset profiling with all metrics.
Args:
df: pandas DataFrame to profile
Returns:
Comprehensive profiling results dictionary
"""
profile = {
'metadata': {
'row_count': len(df),
'column_count': len(df.columns),
'memory_usage_mb': df.memory_usage(deep=True).sum() / (1024 * 1024)
},
'schema': detect_schema(df),
'statistics': get_basic_statistics(df),
'distributions': analyze_distributions(df),
'uniqueness': analyze_uniqueness(df),
'patterns': detect_patterns(df),
'quality': calculate_quality_metrics(df)
}
return profile
# Usage example
if __name__ == "__main__":
# Load dataset
df = pd.read_csv("data.csv")
# Profile
profile_results = profile_dataset(df)
# Print summary
print(f"Dataset Profile:")
print(f" Rows: {profile_results['metadata']['row_count']}")
print(f" Columns: {profile_results['metadata']['column_count']}")
print(f" Completeness: {profile_results['quality']['overall_completeness']:.2f}%")
Best Practices
- Sample Large Datasets: For datasets >1M rows, profile a representative sample
- Handle Missing Data: Check for nulls before calculating statistics
- Type Inference: Use semantic types, not just Python/pandas types
- Memory Management: Be cautious with very wide datasets (many columns)
- Percentiles: Use percentiles to understand distribution better than mean
- Pattern Detection: Test patterns on non-null values only
Output Format for Personas
When using this skill, output results in a structured format:
dataset_profile:
metadata:
row_count: 10000
column_count: 15
memory_usage_mb: 2.5
schema:
columns:
- name: customer_id
type: string
nullable: false
null_count: 0
semantic_type: string
- name: revenue
type: float64
nullable: true
null_count: 250
null_percentage: 2.5
semantic_type: numeric
quality:
overall_completeness: 95.5
columns:
customer_id:
completeness: 100.0
type_consistency: 100.0
revenue:
completeness: 97.5
type_consistency: 100.0
recommendations:
- "customer_id has 98% uniqueness - suitable as primary key"
- "revenue has 2.5% nulls - consider default value or filtering"
- "date column matches ISO 8601 format - ready for datetime conversion"
Integration with Other Skills
- azure-blob-storage: Load data from Azure before profiling
- databricks-query: Profile query results from Unity Catalog
- Use profiling results to inform Data Ingestion Agent about transformations needed
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
