Pca Decomposition
by benchflow-ai
Reduce dimensionality of multivariate data using PCA with varimax rotation. Use when you have many correlated variables and need to identify underlying factors or reduce collinearity.
Skill Details
Repository Files
1 file in this skill directory
name: pca-decomposition description: Reduce dimensionality of multivariate data using PCA with varimax rotation. Use when you have many correlated variables and need to identify underlying factors or reduce collinearity. license: MIT
PCA Decomposition Guide
Overview
Principal Component Analysis (PCA) reduces many correlated variables into fewer uncorrelated components. Varimax rotation makes components more interpretable by maximizing variance.
When to Use PCA
- Many correlated predictor variables
- Need to identify underlying factor groups
- Reduce multicollinearity before regression
- Exploratory data analysis
Basic PCA with Varimax Rotation
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer
# Standardize data first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA with varimax rotation
fa = FactorAnalyzer(n_factors=4, rotation='varimax')
fa.fit(X_scaled)
# Get factor loadings
loadings = fa.loadings_
# Get component scores for each observation
scores = fa.transform(X_scaled)
Workflow for Attribution Analysis
When using PCA for contribution analysis with predefined categories:
- Combine ALL variables first, then do PCA together:
# Include all variables from all categories in one matrix
all_vars = ['AirTemp', 'NetRadiation', 'Precip', 'Inflow', 'Outflow',
'WindSpeed', 'DevelopedArea', 'AgricultureArea']
X = df[all_vars].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA on ALL variables together
fa = FactorAnalyzer(n_factors=4, rotation='varimax')
fa.fit(X_scaled)
scores = fa.transform(X_scaled)
-
Interpret loadings to map factors to categories (optional for understanding)
-
Use factor scores directly for R² decomposition
Important: Do NOT run separate PCA for each category. Run one global PCA on all variables, then use the resulting factor scores for contribution analysis.
Interpreting Factor Loadings
Loadings show correlation between original variables and components:
| Loading | Interpretation |
|---|---|
| > 0.7 | Strong association |
| 0.4 - 0.7 | Moderate association |
| < 0.4 | Weak association |
Example: Economic Indicators
import pandas as pd
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer
# Variables: gdp, unemployment, inflation, interest_rate, exports, imports
df = pd.read_csv('economic_data.csv')
variables = ['gdp', 'unemployment', 'inflation',
'interest_rate', 'exports', 'imports']
X = df[variables].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
fa = FactorAnalyzer(n_factors=3, rotation='varimax')
fa.fit(X_scaled)
# View loadings
loadings_df = pd.DataFrame(
fa.loadings_,
index=variables,
columns=['RC1', 'RC2', 'RC3']
)
print(loadings_df.round(2))
Choosing Number of Factors
Option 1: Kaiser Criterion
# Check eigenvalues
eigenvalues, _ = fa.get_eigenvalues()
# Keep factors with eigenvalue > 1
n_factors = sum(eigenvalues > 1)
Option 2: Domain Knowledge
If you know how many categories your variables should group into, specify directly:
# Example: health data with 3 expected categories (lifestyle, genetics, environment)
fa = FactorAnalyzer(n_factors=3, rotation='varimax')
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| Loadings all similar | Too few factors | Increase n_factors |
| Negative loadings | Inverse relationship | Normal, interpret direction |
| Low variance explained | Data not suitable for PCA | Check correlations first |
Best Practices
- Always standardize data before PCA
- Use varimax rotation for interpretability
- Check factor loadings to name components
- Use Kaiser criterion or domain knowledge for n_factors
- For attribution analysis, run ONE global PCA on all variables
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
