Grey Relation
by SPIRAL-EDWIN
Grey Relational Analysis for MCM/ICM competitions. Use for small-sample correlation analysis and factor importance ranking. Does not require large samples, normal distribution, or linear relationships. Ideal for analyzing which factors most influence outcomes when data is limited.
Skill Details
Repository Files
10 files in this skill directory
name: grey-relation description: Grey Relational Analysis for MCM/ICM competitions. Use for small-sample correlation analysis and factor importance ranking. Does not require large samples, normal distribution, or linear relationships. Ideal for analyzing which factors most influence outcomes when data is limited.
Grey Relational Analysis (GRA)
Analyze correlation between factors and outcomes using grey relational grade, especially for small samples.
When to Use
- Small sample size: 4-20 data points (unlike correlation which needs >30)
- Factor analysis: Which factors most influence the outcome?
- Non-linear relationships: No assumption of linearity
- Multi-factor ranking: Rank importance of multiple influencing factors
- Incomplete information: "Grey" = between black (unknown) and white (known)
Key difference from correlation coefficient:
- No requirement for large samples
- No assumption of linear relationship
- No requirement for normal distribution
This is NOT forecasting: Use grey-forecaster for prediction, use grey-relation for correlation analysis.
Quick Start
Core Concepts
-
Reference sequence (母序列) $X_0$: The outcome/target variable
- Example: Student test scores
-
Comparison sequences (子序列) $X_i$: The influencing factors
- Example: Study hours, sleep hours, exercise hours
-
Grey relational coefficient $\xi_i(k)$: Similarity at each data point $$\xi_i(k) = \frac{\min_i \min_k |x_0(k) - x_i(k)| + \rho \max_i \max_k |x_0(k) - x_i(k)|}{|x_0(k) - x_i(k)| + \rho \max_i \max_k |x_0(k) - x_i(k)|}$$ where $\rho = 0.5$ (resolution coefficient, usually 0.5)
-
Grey relational grade $\gamma_i$: Overall correlation (average of coefficients) $$\gamma_i = \frac{1}{n} \sum_{k=1}^n \xi_i(k)$$
-
Ranking: Higher $\gamma_i$ means stronger correlation
Python Implementation
import numpy as np
# Example: Which factor most affects test scores?
# Columns: [Test Score, Study Hours, Sleep Hours, Exercise Hours]
data = np.array([
[55, 24, 10, 5],
[65, 38, 22, 8],
[75, 40, 18, 6],
[100, 50, 20, 10]
])
# Step 1: Normalization (dimensionless)
mean = np.mean(data, axis=0)
data_norm = data / mean
# Step 2: Reference sequence (outcome)
Y = data_norm[:, 0] # Test scores
# Step 3: Comparison sequences (factors)
X = data_norm[:, 1:] # Study, Sleep, Exercise
# Step 4: Calculate absolute differences |X0 - Xi|
abs_diff = np.abs(X - Y.reshape(-1, 1))
# Step 5: Find min and max differences
a = np.min(abs_diff) # Two-level minimum
b = np.max(abs_diff) # Two-level maximum
# Step 6: Resolution coefficient
rho = 0.5
# Step 7: Grey relational coefficient
gamma_matrix = (a + rho * b) / (abs_diff + rho * b)
# Step 8: Grey relational grade (average)
gamma = np.mean(gamma_matrix, axis=0)
# Step 9: Rank factors
factors = ['Study Hours', 'Sleep Hours', 'Exercise Hours']
ranking = sorted(zip(factors, gamma), key=lambda x: x[1], reverse=True)
print("Grey Relational Grade:")
for factor, grade in ranking:
print(f"{factor}: {grade:.4f}")
print(f"\nMost important factor: {ranking[0][0]}")
MATLAB Implementation
% Data: [Test Score, Study Hours, Sleep Hours, Exercise Hours]
A = [55, 24, 10, 5;
65, 38, 22, 8;
75, 40, 18, 6;
100, 50, 20, 10];
% Step 1: Normalization
Mean = mean(A, 1);
A_norm = A ./ Mean;
% Step 2: Reference sequence
Y = A_norm(:, 1);
% Step 3: Comparison sequences
X = A_norm(:, 2:end);
% Step 4: Absolute differences
absX0_Xi = abs(X - repmat(Y, 1, size(X, 2)));
% Step 5: Min and max
a = min(absX0_Xi(:));
b = max(absX0_Xi(:));
% Step 6: Resolution coefficient
rho = 0.5;
% Step 7: Grey relational coefficient
gamma_matrix = (a + rho * b) ./ (absX0_Xi + rho * b);
% Step 8: Grey relational grade
gamma = mean(gamma_matrix, 1);
% Step 9: Display results
factors = {'Study Hours', 'Sleep Hours', 'Exercise Hours'};
fprintf('Grey Relational Grade:\n');
for i = 1:length(factors)
fprintf('%s: %.4f\n', factors{i}, gamma(i));
end
[~, idx] = max(gamma);
fprintf('\nMost important factor: %s\n', factors{idx});
Standard Procedure
Step 1: Data Preparation
Organize data into matrix:
- Rows: Samples (observations)
- Columns: [Outcome, Factor1, Factor2, ..., FactorN]
Step 2: Normalization (Dimensionless)
Three common methods:
1. Mean normalization (most common): $$x_i'(k) = \frac{x_i(k)}{\bar{x_i}}$$
2. Min-max normalization: $$x_i'(k) = \frac{x_i(k) - \min x_i}{\max x_i - \min x_i}$$
3. Initial value normalization: $$x_i'(k) = \frac{x_i(k)}{x_i(1)}$$
Step 3: Separate Reference and Comparison
- Reference sequence $X_0$: The outcome variable
- Comparison sequences $X_1, X_2, ..., X_m$: The factors
Step 4: Calculate Differences
$$\Delta_i(k) = |x_0(k) - x_i(k)|$$
Step 5: Find Two-Level Extrema
$$a = \min_i \min_k \Delta_i(k)$$ $$b = \max_i \max_k \Delta_i(k)$$
Step 6: Calculate Grey Relational Coefficient
$$\xi_i(k) = \frac{a + \rho b}{\Delta_i(k) + \rho b}$$
where $\rho \in [0, 1]$ is resolution coefficient (usually 0.5)
Step 7: Calculate Grey Relational Grade
$$\gamma_i = \frac{1}{n} \sum_{k=1}^n \xi_i(k)$$
Step 8: Rank Factors
Sort factors by $\gamma_i$ in descending order.
Examples in Skill
See examples/ folder:
code1.py/code1.m: Basic grey relational analysis with normalizationcode2.py/code2.m: Alternative implementationInter2Max.m,Mid2Max.m,Min2Max.m: Data preprocessing functions (convert to "larger is better")Positivization.m: Positive transformation for indicators
Competition Tips
Problem Recognition (Day 1)
Grey relational analysis indicators:
- "Which factors most influence..."
- "Rank importance of factors"
- Small sample size (< 30 observations)
- No clear linear relationship
- Multiple factors, need to find key drivers
Use GRA when:
- Sample size is small (4-20 points)
- Don't know if relationship is linear
- Want to rank factor importance
- Data doesn't meet assumptions for regression/correlation
Formulation Steps (Day 1-2)
- Define outcome variable: What are you trying to explain?
- Identify factors: What might influence the outcome?
- Collect data: Organize into matrix
- Normalize: Make data dimensionless
- Compute GRA: Follow 8-step procedure
- Interpret: Rank factors by grey relational grade
Implementation Template
import numpy as np
# 1. Prepare data (rows=samples, cols=[outcome, factor1, factor2, ...])
data = np.array([
[y1, x1_1, x1_2, ...],
[y2, x2_1, x2_2, ...],
# ...
])
# 2. Normalize
mean = np.mean(data, axis=0)
data_norm = data / mean
# 3. Separate reference and comparison
Y = data_norm[:, 0]
X = data_norm[:, 1:]
# 4. Calculate differences
abs_diff = np.abs(X - Y.reshape(-1, 1))
# 5. Two-level extrema
a = np.min(abs_diff)
b = np.max(abs_diff)
# 6. Grey relational coefficient
rho = 0.5
gamma_matrix = (a + rho * b) / (abs_diff + rho * b)
# 7. Grey relational grade
gamma = np.mean(gamma_matrix, axis=0)
# 8. Rank and display
factors = ['Factor1', 'Factor2', 'Factor3']
for factor, grade in sorted(zip(factors, gamma),
key=lambda x: x[1], reverse=True):
print(f"{factor}: {grade:.4f}")
Validation (Day 3)
- Check normalization: All sequences should be dimensionless
- Verify ranking: Does it match intuition/domain knowledge?
- Sensitivity to $\rho$: Try $\rho = 0.3, 0.5, 0.7$ - ranking should be stable
- Compare with correlation: For validation, compute Pearson correlation
Common Pitfalls
-
Confusing with grey forecasting: GRA is for correlation, not prediction
- Fix: Use
grey-forecasterfor time series prediction
- Fix: Use
-
Too large sample: If n > 50, use standard correlation/regression
- GRA advantage is for small samples
-
Forgot normalization: Different units make comparison meaningless
- Fix: Always normalize first
-
Wrong reference sequence: Used factor as reference instead of outcome
- Fix: Reference = outcome variable, comparison = factors
-
Over-interpreting: High $\gamma$ doesn't mean causation
- GRA shows correlation, not causation
Advanced: Data Preprocessing
Indicator Direction Transformation
Convert all indicators to "larger is better":
1. Cost-type (smaller is better) → Benefit-type: $$x_i'(k) = \max x_i - x_i(k)$$
2. Interval-type (optimal range [a, b]): $$x_i'(k) = \begin{cases} 1 - \frac{a - x_i(k)}{M}, & x_i(k) < a \ 1, & a \leq x_i(k) \leq b \ 1 - \frac{x_i(k) - b}{M}, & x_i(k) > b \end{cases}$$
where $M = \max{a - \min x_i, \max x_i - b}$
Comparison with Other Methods
| Method | Sample Size | Assumptions | When to Use |
|---|---|---|---|
| Grey Relational Analysis | Small (4-20) | None | Small sample, non-linear |
| Pearson Correlation | Large (>30) | Linear, normal | Large sample, linear |
| Spearman Correlation | Medium (>10) | Monotonic | Non-linear but monotonic |
| Regression | Large (>30) | Linear, normal | Prediction, causation |
References
references/5-灰色关联分析【微信公众号丨B站:数模加油站】.pdf: Complete tutorial- Deng, J. (1982) "Control problems of grey systems"
Time Budget
- Data preparation: 15-30 min
- Normalization: 10-15 min
- Implementation: 15-25 min
- Validation: 20-30 min
- Total: 1-1.5 hours
Related Skills
grey-forecaster: Grey prediction (GM(1,1)) for time seriespca-analyzer: Alternative for factor analysis (large samples)entropy-weight-method: Objective weighting based on data variationfuzzy-evaluation: For qualitative factor evaluation
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
