Td Kmeans
by teradata-labs
K-means clustering for customer segmentation and data grouping
Skill Details
Repository Files
11 files in this skill directory
name: td-kmeans description: K-means clustering for customer segmentation and data grouping
Teradata K-Means Clustering
| Skill Name | Teradata K-Means Clustering |
|---|---|
| Description | K-means clustering for customer segmentation and data grouping |
| Category | Clustering Analytics |
| Function | TD_KMeans |
Core Capabilities
- Complete analytical workflow from data exploration to model deployment
- Automated preprocessing including scaling, encoding, and train-test splitting
- Advanced TD_KMeans implementation with parameter optimization
- Comprehensive evaluation metrics and model validation
- Production-ready SQL generation with proper table management
- Error handling and data quality checks throughout the pipeline
- Business-focused interpretation of analytical results
Table Analysis Workflow
This skill automatically analyzes your provided table to generate optimized SQL workflows. Here's how it works:
1. Table Structure Analysis
- Column Detection: Automatically identifies all columns and their data types
- Data Type Classification: Distinguishes between numeric, categorical, and text columns
- Primary Key Identification: Detects unique identifier columns
- Missing Value Assessment: Analyzes data completeness
2. Feature Engineering Recommendations
- Numeric Features: Identifies columns suitable for scaling and normalization
- Categorical Features: Detects columns requiring encoding (one-hot, label encoding)
- Target Variable: Helps identify the dependent variable for modeling
- Feature Selection: Recommends relevant features based on data types
3. SQL Generation Process
- Dynamic Column Lists: Generates column lists based on your table structure
- Parameterized Queries: Creates flexible SQL templates using your table schema
- Table Name Integration: Replaces placeholders with your actual table names
- Database Context: Adapts to your database and schema naming conventions
How to Use This Skill
-
Provide Your Table Information:
"Analyze table: database_name.table_name" or "Use table: my_data with target column: target_var" -
The Skill Will:
- Query your table structure using
SHOW COLUMNS FROM table_name - Analyze data types and suggest appropriate preprocessing
- Generate complete SQL workflow with your specific column names
- Provide optimized parameters based on your data characteristics
- Query your table structure using
Input Requirements
Data Requirements
- Source table: Teradata table with analytical data
- Target column: Dependent variable for clustering analysis
- Feature columns: Independent variables (numeric and categorical)
- ID column: Unique identifier for record tracking
- Minimum sample size: 100+ records for reliable clustering modeling
Technical Requirements
- Teradata Vantage with ClearScape Analytics enabled
- Database permissions: CREATE, DROP, SELECT on working database
- Function access: TD_KMeans, TD_KMeansPredict
Output Formats
Generated Tables
- Preprocessed data tables with proper scaling and encoding
- Train/test split tables for model validation
- Model table containing trained TD_KMeans parameters
- Prediction results with confidence metrics
- Evaluation metrics table with performance statistics
SQL Scripts
- Complete workflow scripts ready for execution
- Parameterized queries for different datasets
- Table management with proper cleanup procedures
Clustering Use Cases Supported
- Customer segmentation: Comprehensive analysis workflow
- Market analysis: Comprehensive analysis workflow
- Data grouping: Comprehensive analysis workflow
Best Practices Applied
- Data validation before analysis execution
- Proper feature scaling and categorical encoding
- Train-test splitting with stratification when appropriate
- Cross-validation for robust model evaluation
- Parameter optimization using systematic approaches
- Residual analysis and diagnostic checks
- Business interpretation of statistical results
- Documentation of methodology and assumptions
Example Usage
-- Example workflow for Teradata K-Means Clustering
-- Replace 'your_table' with actual table name
-- 1. Data exploration and validation
SELECT COUNT(*),
COUNT(DISTINCT your_id_column),
AVG(your_target_column),
STDDEV(your_target_column)
FROM your_database.your_table;
-- 2. Execute complete clustering workflow
-- (Detailed SQL provided by the skill)
Scripts Included
Core Analytics Scripts
preprocessing.sql: Data preparation and feature engineeringtable_analysis.sql: Automatic table structure analysiscomplete_workflow_template.sql: End-to-end workflow templatemodel_training.sql: TD_KMeans training proceduresprediction.sql: TD_KMeansPredict executionevaluation.sql: Model validation and metrics calculation
Utility Scripts
data_quality_checks.sql: Comprehensive data validationparameter_tuning.sql: Systematic parameter optimizationdiagnostic_queries.sql: Model diagnostics and interpretation
Limitations and Disclaimers
- Data quality: Results depend on input data quality and completeness
- Sample size: Minimum sample size requirements for reliable results
- Feature selection: Manual feature engineering may be required
- Computational resources: Large datasets may require optimization
- Business context: Statistical results require domain expertise for interpretation
- Model assumptions: Understand underlying mathematical assumptions
Quality Checks
Automated Validations
- Data completeness verification before analysis
- Statistical assumptions testing where applicable
- Model convergence monitoring during training
- Prediction quality assessment using validation data
- Performance metrics calculation and interpretation
Manual Review Points
- Feature selection appropriateness for business problem
- Model interpretation alignment with domain knowledge
- Results validation against business expectations
- Documentation completeness for reproducibility
Updates and Maintenance
- Version compatibility: Tested with latest Teradata Vantage releases
- Performance optimization: Regular query performance reviews
- Best practices: Updated based on analytics community feedback
- Documentation: Maintained with latest ClearScape Analytics features
- Examples: Updated with real-world use cases and scenarios
This skill provides production-ready clustering analytics using Teradata ClearScape Analytics TD_KMeans with comprehensive data science best practices.
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
