Clustering Analyzer
by dkyazzentwatwa
Cluster data using K-Means, DBSCAN, hierarchical clustering. Use for customer segmentation, pattern discovery, or data grouping.
Skill Details
Repository Files
3 files in this skill directory
name: clustering-analyzer description: Cluster data using K-Means, DBSCAN, hierarchical clustering. Use for customer segmentation, pattern discovery, or data grouping.
Clustering Analyzer
Analyze and cluster data using multiple algorithms with visualization and evaluation.
Features
- K-Means: Partition-based clustering with elbow method
- DBSCAN: Density-based clustering for arbitrary shapes
- Hierarchical: Agglomerative clustering with dendrograms
- Evaluation: Silhouette scores, cluster statistics
- Visualization: 2D/3D plots, dendrograms, elbow curves
- Export: Labeled data, cluster summaries
Quick Start
from clustering_analyzer import ClusteringAnalyzer
analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv")
# K-Means clustering
result = analyzer.kmeans(n_clusters=3)
print(f"Silhouette Score: {result['silhouette_score']:.3f}")
# Visualize
analyzer.plot_clusters("clusters.png")
CLI Usage
# K-Means clustering
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3
# Find optimal clusters (elbow method)
python clustering_analyzer.py --input data.csv --method kmeans --find-optimal
# DBSCAN clustering
python clustering_analyzer.py --input data.csv --method dbscan --eps 0.5 --min-samples 5
# Hierarchical clustering
python clustering_analyzer.py --input data.csv --method hierarchical --clusters 4
# Generate plots
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --plot clusters.png
# Export labeled data
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --output labeled.csv
# Select specific columns
python clustering_analyzer.py --input data.csv --columns age,income,spending --method kmeans --clusters 3
API Reference
ClusteringAnalyzer Class
class ClusteringAnalyzer:
def __init__(self)
# Data loading
def load_csv(self, filepath: str, columns: list = None) -> 'ClusteringAnalyzer'
def load_dataframe(self, df: pd.DataFrame, columns: list = None) -> 'ClusteringAnalyzer'
# Clustering methods
def kmeans(self, n_clusters: int, **kwargs) -> dict
def dbscan(self, eps: float = 0.5, min_samples: int = 5) -> dict
def hierarchical(self, n_clusters: int, linkage: str = "ward") -> dict
# Optimal clusters
def find_optimal_clusters(self, max_k: int = 10) -> dict
def elbow_plot(self, output: str, max_k: int = 10) -> str
# Evaluation
def silhouette_score(self) -> float
def cluster_statistics(self) -> dict
# Visualization
def plot_clusters(self, output: str, dimensions: list = None) -> str
def plot_dendrogram(self, output: str) -> str
def plot_silhouette(self, output: str) -> str
# Export
def get_labels(self) -> list
def to_dataframe(self) -> pd.DataFrame
def save_labeled(self, output: str) -> str
Clustering Methods
K-Means
Best for spherical clusters with known number of groups:
result = analyzer.kmeans(n_clusters=3)
# Returns:
{
"labels": [0, 1, 2, 0, ...],
"n_clusters": 3,
"silhouette_score": 0.65,
"inertia": 1234.56,
"cluster_sizes": {0: 150, 1: 200, 2: 100},
"centroids": [[...], [...], [...]]
}
DBSCAN
Best for arbitrary-shaped clusters:
result = analyzer.dbscan(eps=0.5, min_samples=5)
# Returns:
{
"labels": [0, 0, 1, -1, ...], # -1 = noise
"n_clusters": 3,
"n_noise": 15,
"silhouette_score": 0.58,
"cluster_sizes": {0: 150, 1: 200, 2: 100}
}
Hierarchical (Agglomerative)
Best for understanding cluster hierarchy:
result = analyzer.hierarchical(n_clusters=4, linkage="ward")
# Returns:
{
"labels": [0, 1, 2, 3, ...],
"n_clusters": 4,
"silhouette_score": 0.62,
"cluster_sizes": {0: 100, 1: 150, 2: 120, 3: 80}
}
Finding Optimal Clusters
Elbow Method
optimal = analyzer.find_optimal_clusters(max_k=10)
# Returns:
{
"optimal_k": 4,
"inertias": [1000, 800, 500, 300, 280, ...],
"silhouettes": [0.5, 0.55, 0.6, 0.65, 0.63, ...]
}
Elbow Plot
analyzer.elbow_plot("elbow.png", max_k=10)
Generates plot showing inertia vs number of clusters.
Cluster Statistics
stats = analyzer.cluster_statistics()
# Returns:
{
"n_clusters": 3,
"cluster_sizes": {0: 150, 1: 200, 2: 100},
"cluster_means": {
0: {"age": 25.5, "income": 45000, ...},
1: {"age": 45.2, "income": 75000, ...},
2: {"age": 35.1, "income": 55000, ...}
},
"cluster_std": {
0: {"age": 5.2, "income": 8000, ...},
...
},
"overall_silhouette": 0.65
}
Visualization
Cluster Plot
# 2D plot (uses first 2 features or PCA)
analyzer.plot_clusters("clusters_2d.png")
# Specify dimensions
analyzer.plot_clusters("clusters.png", dimensions=["age", "income"])
Dendrogram
# For hierarchical clustering
analyzer.hierarchical(n_clusters=4)
analyzer.plot_dendrogram("dendrogram.png")
Silhouette Plot
analyzer.plot_silhouette("silhouette.png")
Shows silhouette coefficient for each sample.
Export Results
Get Labels
labels = analyzer.get_labels()
# [0, 1, 2, 0, 1, ...]
Save Labeled Data
analyzer.save_labeled("labeled_data.csv")
# Original data + cluster_label column
Get Full DataFrame
df = analyzer.to_dataframe()
# DataFrame with cluster_label column
Example Workflows
Customer Segmentation
analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv", columns=["age", "income", "spending_score"])
# Find optimal number of segments
optimal = analyzer.find_optimal_clusters(max_k=8)
print(f"Optimal segments: {optimal['optimal_k']}")
# Cluster with optimal k
result = analyzer.kmeans(n_clusters=optimal['optimal_k'])
# Get segment characteristics
stats = analyzer.cluster_statistics()
for cluster_id, means in stats["cluster_means"].items():
print(f"\nSegment {cluster_id}:")
for feature, value in means.items():
print(f" {feature}: {value:.2f}")
# Save segmented data
analyzer.save_labeled("customer_segments.csv")
Anomaly Detection with DBSCAN
analyzer = ClusteringAnalyzer()
analyzer.load_csv("transactions.csv", columns=["amount", "frequency"])
# DBSCAN identifies noise points as potential anomalies
result = analyzer.dbscan(eps=0.3, min_samples=10)
print(f"Found {result['n_noise']} potential anomalies")
# Get anomalous records
df = analyzer.to_dataframe()
anomalies = df[df["cluster_label"] == -1]
Document Clustering
# After TF-IDF transformation
analyzer = ClusteringAnalyzer()
analyzer.load_dataframe(tfidf_matrix)
# Hierarchical clustering to see document relationships
result = analyzer.hierarchical(n_clusters=5)
analyzer.plot_dendrogram("doc_dendrogram.png")
Data Preprocessing
The analyzer automatically:
- Handles missing values (imputation)
- Scales features (standardization)
- Reduces dimensions for visualization (PCA)
For custom preprocessing:
from sklearn.preprocessing import StandardScaler
# Preprocess manually
df = pd.read_csv("data.csv")
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
# Load preprocessed data
analyzer.load_dataframe(df_scaled)
Dependencies
- scikit-learn>=1.3.0
- pandas>=2.0.0
- numpy>=1.24.0
- matplotlib>=3.7.0
- scipy>=1.10.0
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
