Pca Analyzer

by SPIRAL-EDWIN

data

Dimensionality reduction technique using Principal Component Analysis. Extracts key features, removes multicollinearity, and visualizes high-dimensional data.

Skill Details

Repository Files

5 files in this skill directory


name: pca-analyzer description: Dimensionality reduction technique using Principal Component Analysis. Extracts key features, removes multicollinearity, and visualizes high-dimensional data.

Principal Component Analysis (PCA)

Transforms a large set of variables into a smaller one that still contains most of the information in the large set.

When to Use

  • Dimensionality Reduction: When you have too many variables (e.g., > 10) relative to your sample size.
  • Multicollinearity: When independent variables are highly correlated (which breaks regression models).
  • Feature Extraction: To create new, uncorrelated indices (Principal Components) for ranking or clustering.
  • Visualization: To plot high-dimensional data in 2D or 3D.

Algorithm Steps

  1. Standardization: Scale data to have mean=0 and variance=1 (Critical step, as PCA is sensitive to scale).
  2. Covariance Matrix: Compute the relationship between variables.
  3. Eigendecomposition: Calculate eigenvalues and eigenvectors of the covariance matrix.
  4. Selection: Sort eigenvalues and keep the top $k$ components that explain sufficient variance (e.g., > 85%).
  5. Projection: Transform the original data onto the new principal component axes.

Implementation Template

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

def pca_analyzer(df, n_components=None, variance_threshold=0.85):
    """
    Perform PCA analysis.
    
    Args:
        df (pd.DataFrame): Numerical data features.
        n_components (int): Number of components to keep. If None, uses variance_threshold.
        variance_threshold (float): Keep components until cumulative variance > threshold.
        
    Returns:
        dict: {
            'pca_model': sklearn PCA object,
            'transformed_data': pd.DataFrame (the principal components),
            'loadings': pd.DataFrame (correlations between vars and PCs),
            'explained_variance': np.array
        }
    """
    # 1. Standardization (Z-score normalization)
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(df)
    
    # 2. Fit PCA
    # If n_components is None, fit all first to check variance
    pca_full = PCA()
    pca_full.fit(data_scaled)
    
    # Determine n_components based on threshold if not specified
    if n_components is None:
        cumsum = np.cumsum(pca_full.explained_variance_ratio_)
        # +1 because index starts at 0
        n_components = np.argmax(cumsum >= variance_threshold) + 1
        print(f"Selected {n_components} components to explain {variance_threshold*100}% variance.")
    
    # Refit with chosen n_components
    pca = PCA(n_components=n_components)
    data_pca = pca.fit_transform(data_scaled)
    
    # 3. Create Result DataFrame
    pc_columns = [f'PC{i+1}' for i in range(n_components)]
    df_pca = pd.DataFrame(data_pca, columns=pc_columns, index=df.index)
    
    # 4. Calculate Loadings (Eigenvectors * sqrt(Eigenvalues))
    # Loadings represent the correlation between original variables and PCs
    loadings = pd.DataFrame(
        pca.components_.T * np.sqrt(pca.explained_variance_), 
        columns=pc_columns, 
        index=df.columns
    )
    
    return {
        'model': pca,
        'transformed_data': df_pca,
        'loadings': loadings,
        'explained_variance_ratio': pca.explained_variance_ratio_
    }

def plot_pca_results(pca_result):
    """Visualize Scree Plot and Loadings Heatmap"""
    pca = pca_result['model']
    loadings = pca_result['loadings']
    
    # A. Scree Plot (Pareto Chart style)
    var_ratio = pca.explained_variance_ratio_
    cum_var_ratio = np.cumsum(var_ratio)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Scree Plot
    x = range(1, len(var_ratio) + 1)
    ax1.bar(x, var_ratio, alpha=0.6, align='center', label='Individual explained variance')
    ax1.step(x, cum_var_ratio, where='mid', label='Cumulative explained variance')
    ax1.set_ylabel('Explained variance ratio')
    ax1.set_xlabel('Principal components')
    ax1.set_title('Scree Plot')
    ax1.legend(loc='best')
    ax1.grid(True, alpha=0.3)
    
    # B. Loadings Heatmap
    sns.heatmap(loadings, annot=True, cmap='coolwarm', center=0, ax=ax2)
    ax2.set_title('Factor Loadings (Correlations)')
    
    plt.tight_layout()
    plt.savefig('pca_analysis.png', dpi=300)
    plt.show()

# --- Usage Example ---
if __name__ == "__main__":
    # Mock Data: 5 correlated variables
    np.random.seed(42)
    n_samples = 100
    # X1, X2 highly correlated; X3, X4 highly correlated
    X = np.random.rand(n_samples, 5)
    X[:, 1] = X[:, 0] * 0.8 + np.random.normal(0, 0.1, n_samples) 
    
    df = pd.DataFrame(X, columns=['VarA', 'VarB', 'VarC', 'VarD', 'VarE'])
    
    # Run PCA
    result = pca_analyzer(df, variance_threshold=0.90)
    
    print("Transformed Data Head:")
    print(result['transformed_data'].head())
    
    print("\nLoadings (Interpretation):")
    print(result['loadings'])
    
    # Visualize
    plot_pca_results(result)

Output Interpretation

  • Scree Plot: The "elbow" point indicates the optimal number of components.
  • Loadings:
    • High absolute value (>0.5) = Strong relationship.
    • Positive/Negative sign = Direction of correlation.
    • Use these to name your components (e.g., if PC1 correlates with GDP and Income, call it "Economic Factor").
  • Transformed Data: Use these new PC1, PC2 columns as inputs for regression or clustering (K-Means).

Integration Workflow

  • Input: Use data-cleaner first. PCA hates missing values.
  • Downstream:
    • Clustering: Feed transformed_data into K-Means or Hierarchical Clustering.
    • Regression: Use PCs as independent variables to avoid multicollinearity (Principal Component Regression).
    • Evaluation: Sometimes PC1 score is used as a comprehensive evaluation index itself.

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:1/28/2026