Pca Analyzer
by SPIRAL-EDWIN
Dimensionality reduction technique using Principal Component Analysis. Extracts key features, removes multicollinearity, and visualizes high-dimensional data.
Skill Details
Repository Files
5 files in this skill directory
name: pca-analyzer description: Dimensionality reduction technique using Principal Component Analysis. Extracts key features, removes multicollinearity, and visualizes high-dimensional data.
Principal Component Analysis (PCA)
Transforms a large set of variables into a smaller one that still contains most of the information in the large set.
When to Use
- Dimensionality Reduction: When you have too many variables (e.g., > 10) relative to your sample size.
- Multicollinearity: When independent variables are highly correlated (which breaks regression models).
- Feature Extraction: To create new, uncorrelated indices (Principal Components) for ranking or clustering.
- Visualization: To plot high-dimensional data in 2D or 3D.
Algorithm Steps
- Standardization: Scale data to have mean=0 and variance=1 (Critical step, as PCA is sensitive to scale).
- Covariance Matrix: Compute the relationship between variables.
- Eigendecomposition: Calculate eigenvalues and eigenvectors of the covariance matrix.
- Selection: Sort eigenvalues and keep the top $k$ components that explain sufficient variance (e.g., > 85%).
- Projection: Transform the original data onto the new principal component axes.
Implementation Template
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
def pca_analyzer(df, n_components=None, variance_threshold=0.85):
"""
Perform PCA analysis.
Args:
df (pd.DataFrame): Numerical data features.
n_components (int): Number of components to keep. If None, uses variance_threshold.
variance_threshold (float): Keep components until cumulative variance > threshold.
Returns:
dict: {
'pca_model': sklearn PCA object,
'transformed_data': pd.DataFrame (the principal components),
'loadings': pd.DataFrame (correlations between vars and PCs),
'explained_variance': np.array
}
"""
# 1. Standardization (Z-score normalization)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df)
# 2. Fit PCA
# If n_components is None, fit all first to check variance
pca_full = PCA()
pca_full.fit(data_scaled)
# Determine n_components based on threshold if not specified
if n_components is None:
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
# +1 because index starts at 0
n_components = np.argmax(cumsum >= variance_threshold) + 1
print(f"Selected {n_components} components to explain {variance_threshold*100}% variance.")
# Refit with chosen n_components
pca = PCA(n_components=n_components)
data_pca = pca.fit_transform(data_scaled)
# 3. Create Result DataFrame
pc_columns = [f'PC{i+1}' for i in range(n_components)]
df_pca = pd.DataFrame(data_pca, columns=pc_columns, index=df.index)
# 4. Calculate Loadings (Eigenvectors * sqrt(Eigenvalues))
# Loadings represent the correlation between original variables and PCs
loadings = pd.DataFrame(
pca.components_.T * np.sqrt(pca.explained_variance_),
columns=pc_columns,
index=df.columns
)
return {
'model': pca,
'transformed_data': df_pca,
'loadings': loadings,
'explained_variance_ratio': pca.explained_variance_ratio_
}
def plot_pca_results(pca_result):
"""Visualize Scree Plot and Loadings Heatmap"""
pca = pca_result['model']
loadings = pca_result['loadings']
# A. Scree Plot (Pareto Chart style)
var_ratio = pca.explained_variance_ratio_
cum_var_ratio = np.cumsum(var_ratio)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Scree Plot
x = range(1, len(var_ratio) + 1)
ax1.bar(x, var_ratio, alpha=0.6, align='center', label='Individual explained variance')
ax1.step(x, cum_var_ratio, where='mid', label='Cumulative explained variance')
ax1.set_ylabel('Explained variance ratio')
ax1.set_xlabel('Principal components')
ax1.set_title('Scree Plot')
ax1.legend(loc='best')
ax1.grid(True, alpha=0.3)
# B. Loadings Heatmap
sns.heatmap(loadings, annot=True, cmap='coolwarm', center=0, ax=ax2)
ax2.set_title('Factor Loadings (Correlations)')
plt.tight_layout()
plt.savefig('pca_analysis.png', dpi=300)
plt.show()
# --- Usage Example ---
if __name__ == "__main__":
# Mock Data: 5 correlated variables
np.random.seed(42)
n_samples = 100
# X1, X2 highly correlated; X3, X4 highly correlated
X = np.random.rand(n_samples, 5)
X[:, 1] = X[:, 0] * 0.8 + np.random.normal(0, 0.1, n_samples)
df = pd.DataFrame(X, columns=['VarA', 'VarB', 'VarC', 'VarD', 'VarE'])
# Run PCA
result = pca_analyzer(df, variance_threshold=0.90)
print("Transformed Data Head:")
print(result['transformed_data'].head())
print("\nLoadings (Interpretation):")
print(result['loadings'])
# Visualize
plot_pca_results(result)
Output Interpretation
- Scree Plot: The "elbow" point indicates the optimal number of components.
- Loadings:
- High absolute value (>0.5) = Strong relationship.
- Positive/Negative sign = Direction of correlation.
- Use these to name your components (e.g., if PC1 correlates with GDP and Income, call it "Economic Factor").
- Transformed Data: Use these new
PC1,PC2columns as inputs for regression or clustering (K-Means).
Integration Workflow
- Input: Use
data-cleanerfirst. PCA hates missing values. - Downstream:
- Clustering: Feed
transformed_datainto K-Means or Hierarchical Clustering. - Regression: Use PCs as independent variables to avoid multicollinearity (Principal Component Regression).
- Evaluation: Sometimes PC1 score is used as a comprehensive evaluation index itself.
- Clustering: Feed
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
