Data Analyst

by lastdays03

data

Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다.

Skill Details

Repository Files

2 files in this skill directory


name: data-analyst description: Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다.

Data Analyst Workflow

Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다. OSEMN 방법론과 SKILL.md의 표준을 따릅니다.

0단계: 기본 원칙 (Core Principles)

  1. Explanation First (선 설명 후 코드): 코드를 작성하기 전에 무엇을, 왜 분석하는지 Markdown으로 서술합니다.
  2. Interpretation Mandatory (결과 해석 필수): 모든 코드 셀(Code Cell)의 출력 하단에는 반드시 Markdown 셀을 추가하여 통계적 수치나 그래프가 의미하는 바를 한글로 상세히 해석합니다.
    • Bad: 그래프만 덩그러니 있음.
    • Good: "히스토그램의 꼬리가 우측으로 긴 것으로 보아(Skewed > 1), 로그 변환이 필요함을 시사합니다."

1단계: 분석 환경 및 목표 정의 (Environment & Goal)

  1. Context Loading:
    • this document를 로드하여 **'Core Principles'**를 확인합니다.
    • Methodology Screening (방법론 스크리닝):
      • SKILL.md의 **'Methodology Master List'**를 스캔합니다.
      • 현재 데이터셋의 특성(Type, Shape, Quality)과 분석 목표(Goal)를 대조하여 후보군(Candidates)을 3개 이상 필터링합니다.
      • [Check]: 각 후보군 선정 이유를 한 줄로 서술하였는가? (예: "결측치가 많으므로 Random Forest 채택")
  2. Define Objective:
    • resources/plan-template.md를 사용하여 docs/plans/ANALYSIS_[주제].md를 작성합니다.
    • 단순 통계/정확도 외에 **비즈니스 임팩트(KPI)**를 성공 지표로 정의합니다.
  3. Notebook Setup:
    • docs/notebooks/[Topic]_Analysis.ipynb를 생성하거나 엽니다.
    • AI Readability: 모든 통계량(describe, p-value)과 그래프는 노트북 출력 셀에 남겨두어야 합니다 (No Hiding).

3단계: 방법론 스크리닝 (Methodology Screening)

"The Right Tool for the Job"

데이터 특성에 맞춰 SKILL.mdMethodology Master List에서 최적의 기법을 선정하고 Plan에 명시합니다.

  • Metric Selection: 비즈니스 목표와 데이터 불균형 여부에 따라 평가지표를 선택합니다. (예: Imbalanced → F1/Recall/PRCURSOR, Regression → MAPE/RMSE)
  • Preprocessing:
    • Outliers: 이상치가 많으면 RobustScaler 선택.
    • High Cardinality: 카테고리가 너무 많으면 Target Encoding 고려.
    • Imbalance: SMOTE 또는 Class Weights 적용 여부 결정.
  • Modeling:
    • Baseline: Logistic/Linear Regression, Decision Tree.
    • Advanced: CatBoost(Categorical), XGBoost/LightGBM(Large Data), Isolation Forest(Anomaly) 등 선택.
  • Validation стратегия:
    • Time Series: 반드시 Time Series Split 사용.
    • Imbalanced: Stratified K-Fold 필수.

2단계: 데이터 적재 및 품질 검증 (Obtain & Scrub)

Garbage In, Garbage Out을 방지하기 위한 데이터 신뢰성 확보 단계입니다.

  1. Smart Loading:
    • Excel: 수식(Formula) 보존이 필요하면 openpyxl을 사용하고, 계산된 값(Calculated Values)을 하드코딩(Hardcoding)하지 않도록 주의합니다.
    • CSV: 인코딩(utf-8, cp949) 및 구분자(Delimiter) 자동 감지 기능을 활용합니다.
  2. Data Profiling:
    • 기초 통계량(describe()), 결측치(isnull().sum()), 데이터 타입(dtypes)을 확인합니다.
    • [Markdown]: 확인된 결측치 현황과 데이터 타입을 텍스트로 요약합니다.
  3. Deep Sanity Check:
    • SKILL.md'Logical Failures' 항목을 참조하여 정밀 점검합니다.
    • 불가능한 값 감지: 음수 나이(Age < 0), 미래의 날짜(Future Dates) 등 상식 밖의 데이터를 식별합니다.
  4. Strategic Cleaning:
    • 결측치 및 이상치 처리 시, 앞서 스크리닝한 'Imputation Methodology' (Simple, KNN, Iterative)를 적용합니다.
    • 처리 기준(삭제 vs 대치)과 근거를 명시합니다.

3단계: 가설 주도적 탐색 (Hypothesis Driven EDA)

단순한 그래프 나열을 지양하고, 질문(Question) -> 시각화(Viz) -> 발견(Finding)의 흐름을 유지합니다.

  1. Univariate Analysis: 개별 변수의 분포, 왜도(Skewness), 첨도(Kurtosis)를 확인합니다.
    • [Markdown]: 각 변수의 분포 특성과 이상치 유무를 서술합니다.
  2. Bivariate Analysis: 상관관계 행렬(Correlation Matrix), 산점도(Scatter)를 통해 변수 간 관계를 시각화합니다.
    • [Markdown]: 변수 간의 선형/비선형 관계를 해석합니다.
  3. Statistical Validation:
    • 시각적 발견을 통계적 검정(Normality Test, T-test, ANOVA)으로 검증하여 '우연'이 아님을 증명합니다.
    • [Markdown]: p-value를 근거로 귀무가설 기각 여부를 명시합니다.
  4. Insight Logging:
    • 분석 중간에 발견된 중요한 사실이 비즈니스 KPI에 미칠 영향을 즉시 기록합니다.

4단계: 모델링 및 해석 (Model & Interpret) [Optional]

  1. Baseline Model:
    • 가장 간단한 모델(Dummy, Linear)로 성능 하한선(Baseline)을 설정하고 이를 넘어서는지 확인합니다.
  2. Feature Engineering:
    • SKILL.md 리스트의 기법(Encoding, Scaling, PCA)을 적용합니다.
  3. 4.3 Advanced Modeling & Tuning

  • Model Selection: 선정된 고성능 모델(XGBoost, CatBoost, RF 등)을 학습합니다.
  • Hyperparameter Tuning: 단순히 Grid Search를 넘어 Optuna 등을 활용해 효율적으로 최적화합니다.
  • Rigorous Validation: Cross Validation 점수와 Hold-out Test 점수의 차이를 확인하여 오버피팅을 감지합니다. (Gap > 5% 시 경고)
    • Clustering: K-Means 등을 사용할 경우 Elbow MethodSilhouette Score로 최적의 군집 수(K)를 결정합니다.
  1. Rigorous Validation:
    • Stratified K-Fold를 사용하여 과적합을 방지하고 일반화 성능을 평가합니다.
  2. Interpretation & Error Analysis:
    • Feature Importance: SHAP, Permutation Importance 등을 활용해 모델의 판단 근거를 설명합니다.
    • Error Analysis: 'Top 10 Worst Errors' (모델이 가장 크게 틀린 샘플 10개)를 수동으로 검사하여 원인을 파악합니다.
  3. Reporting:
    • 기술적 수치(Accuracy, MSE)를 비즈니스 언어(예상 수익, 리스크 감소)로 번역하여 보고합니다.

Standards & Rules

Data Analyst Standards (OSEMN)

Purpose

To transform raw data into actionable business insights using a rigorous, hypothesis-driven approach.

Core Principles (Core Philosophy)

"No Hiding" AI Readability: All statistical outputs (describe, corr, p-value) and graphs must remain in the notebook output. This ensures tools like NotebookLM can contextually understand the analysis.

Quality Standards (Tier 1 Best Practices)

1. Data Integrity (Obtain & Scrub)

"Garbage In, Garbage Out"

  • Data Source Verification: Check file extension, size, and metadata.
  • Data Quality Check:
    • Missing Values: Identify mechanism (MCAR, MAR, MNAR) before imputing.
    • Logical Failures: Check for impossible values (e.g., age < 0, future dates), Data Leakage, and Overfitting indicators.
    • Data Types: Ensure numeric cols are not strings, etc.
  • File Handling Standards:
    • Excel (.xlsx): Preserve existing formatting/formulas. Use openpyxl for editing, pandas for reading. Zero Hardcoding of calculated values.
    • CSV (.csv): Detect delimiter automatically (csv.Sniffer). Handle encoding errors (utf-8 vs cp949) explicitly.

2. Hypothesis Driven EDA (Explore)

"Ask, Don't just Plot"

  • Univariate Analysis: Distribution of each key variable (Histogram/Boxplot). Check for Skewness/Kurtosis.
  • Bivariate Analysis: Correlation matrix, Scatter plots for relationships.
  • Statistical Validation:
    • Normality Test: Shapiro-Wilk or K-S test.
    • Significance: T-test/ANOVA for group differences.
  • Insight Logging: Record the implication of every finding immediately.

3. Rigorous Modeling (Model)

"Trust but Verify"

  • Baseline First: Compare complex models against a Dummy/Logistic Baseline.
  • Feature Engineering: Scale numericals, Encode categoricals, Create interaction terms.
  • Cross-Validation: Use Stratified K-Fold to prevent overfitting.
  • Metric Selection: Optimize for business KPI (not just Accuracy).
  • Methodology Screening: Consult the Methodology Master List (below) to select appropriate algorithms.

4. Interpretation (Interpret)

"Why did it predict that?"

  • Feature Importance: SHAP values or Permutation Importance.
  • Error Analysis: Manually inspect the "Top 10 Worst Errors".
  • Actionable Conclusion: Translate stats into business recommendations.

Checklist (Quality Gate)

Before finalizing:

  • Reproducibility: Can the notebook run from top to bottom without error?
  • Storytelling: Does the notebook flow like a narrative?
  • Visuals: Are all graphs labeled (Title, Axis, Legend)?

Appendix: Methodology Master List (Reference)

Scan these tables to select the most appropriate methodology for your data and goal.

1. Preprocessing & Data Cleaning

Methodology Usage / Purpose Data Constraints
Simple Imputation Missing Value Imputation (Simple Replacement) Mean/Median (Numeric), Mode (Categorical)
KNN Imputation Missing Value Imputation (Similarity-based) Mainly Numeric, useful when correlations exist
Iterative Imputation Missing Value Imputation (Model-based) High variable correlation, assumes MAR
One-Hot Encoding Categorical to Numeric Nominal data, Low Cardinality
Label Encoding Categorical to Numeric Ordinal data
Target Encoding Categorical to Numeric High Cardinality features, Risk of Overfitting
Standard Scaler Scaling (Standardization) Sensitive to outliers, assumes Gaussian distribution
MinMax Scaler Scaling (Normalization) Bounded data, distribution agnostic
Robust Scaler Scaling (Robust to Outliers) Data with many outliers (Uses Median/IQR)
SMOTE Oversampling (Imbalanced Data) Synthesize minority class samples (Training set ONLY)
PCA Dimensionality Reduction, Multicollinearity Removal Continuous variables, assumes linear relationships

2. Machine Learning Models

Methodology Type Usage / Purpose Constraints / Notes
Linear Regression Regression Baseline for regression Linear relationship assumption
Logistic Regression Classification Baseline for classification Linear separation assumption, large sparse data OK
SVM / SVR Class/Reg High accuracy in high dimensional spaces Computationally expensive (O(n^3)), Scale-sensitive
K-Nearest Neighbors Class/Reg Instance-based learning, Simple Scale-sensitive, Small data
Random Forest Ensemble Robust Classification/Regression Handles Mixed types, Robust to outliers/missing values
XGBoost / LightGBM Ensemble High Performance Large datasets, handles missing values internally
CatBoost Ensemble Best for Categorical Features Handles categories automatically, Slower training
Isolation Forest Anomaly Detection Outlier/Anomaly Detection High dimensional data, efficiency
K-Means Clustering Partitioning into K clusters Spherical Clusters, Sensitive to outliers, Scale-sensitive
DBSCAN Clustering Density-based clustering, Detects Outliers Arbitrary shapes, Scale-sensitive, finding epsilon is hard
Hierarchical Clustering Dendrogram visualization Computationally expensive for large data

3. Deep Learning Models

Methodology Usage / Purpose Data Constraints
CNN Image/Pattern Recognition Grid-like data (Images, etc.)
RNN / LSTM Sequence/Time-Series Prediction Sequential data
Transformer NLP, Complex Pattern Matching Long sequences, Large-scale data

4. Validation & Optimization

Methodology Type Usage / Purpose Notes
Stratified K-Fold Validation Cross Validation (Generalization) Essential for Imbalanced Class distribution
K-Fold CV Validation Cross Validation Sufficient data, Balanced classes
Time Series Split Validation Cross Validation (Temporal) No future data leakage (essential for time-series)
Grid Search Tuning Hyperparameter Optimization Small search space (Exhaustive)
Bayesian Optimization Tuning Hyperparameter Optimization Large search space, High evaluation cost
Optuna Tuning Next-gen Hyperparameter Optimization Efficient, Define-by-run, Pruning capabilities
L1 (Lasso) Regularization Sparse Model, Feature Selection When sparse solution is needed
L2 (Ridge) Regularization Prevent Overfitting, Weight Decay When high multicollinearity exists
ElasticNet Regularization Combination of L1 and L2 When both feature selection and regularization needed

5. Interpretation

Methodology Usage / Purpose Notes
SHAP Explain Model Predictions Specialized for Tree-based models

Appendix: Evaluation Metrics Guide

Select metrics based on your problem type and business goal.

Classification Metrics

Metric Focus When to use
Accuracy Overall Correctness Balanced datasets only. Misleading for imbalanced data.
Precision False Positive Reduction When FP is costly (e.g., Spam Filter).
Recall False Negative Reduction When FN is critical (e.g., Cancer Diagnosis, Fraud).
F1 Score Balance When you need a balance between Precision and Recall.
ROC-AUC Ranking Quality When you need robust performance across thresholds.
Log Loss Probability Confidence When the predicted probability value itself matters.

Regression Metrics

Metric Focus When to use
MSE Large Error Penalty When outliers/large errors should be heavily penalized.
RMSE Interpretability When you need error in the same unit as the target.
MAE Robustness When you want to be robust against outliers.
R2 Score Explainability To see how much variance is explained by the model.
MAPE Business Interpretability Error in Percentage (%). Easy for stakeholders.

Clustering Metrics (Unsupervised)

Metric Focus When to use
Silhouette Score Cluster Separation To measure how similar an object is to its own cluster compared to other clusters.
Davies-Bouldin Cluster Compactness Lower is better. Good for comparing clustering algorithms.
Elbow Method Optimal K To find the inflection point (optimal K) in K-Means.

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:1/20/2026