Microdf
by PolicyEngine
Weighted pandas DataFrames for survey microdata analysis - inequality, poverty, and distributional calculations
Skill Details
Repository Files
1 file in this skill directory
name: microdf description: Weighted pandas DataFrames for survey microdata analysis - inequality, poverty, and distributional calculations
MicroDF
MicroDF provides weighted pandas DataFrames and Series for analyzing survey microdata, with built-in support for inequality and poverty calculations.
For Users 👥
What is MicroDF?
When you see poverty rates, Gini coefficients, or distributional charts in PolicyEngine, those are calculated using MicroDF.
MicroDF powers:
- Poverty rate calculations (SPM)
- Inequality metrics (Gini coefficient)
- Income distribution analysis
- Weighted statistics from survey data
Understanding the Metrics
Gini coefficient:
- Calculated using MicroDF from weighted income data
- Ranges from 0 (perfect equality) to 1 (perfect inequality)
- US typically around 0.48
Poverty rates:
- Calculated using MicroDF with weighted household data
- Compares income to poverty thresholds
- Accounts for household composition
Percentiles:
- MicroDF calculates weighted percentiles
- Shows income distribution (10th, 50th, 90th percentile)
For Analysts 📊
Installation
pip install microdf-python
Quick Start
import microdf as mdf
import pandas as pd
# Create sample data
df = pd.DataFrame({
'income': [10000, 20000, 30000, 40000, 50000],
'weights': [1, 2, 3, 2, 1]
})
# Create MicroDataFrame
mdf_df = mdf.MicroDataFrame(df, weights='weights')
# All operations are weight-aware
print(f"Weighted mean: ${mdf_df.income.mean():,.0f}")
print(f"Gini coefficient: {mdf_df.income.gini():.3f}")
Common Operations
Weighted statistics:
mdf_df.income.mean() # Weighted mean
mdf_df.income.median() # Weighted median
mdf_df.income.sum() # Weighted sum
mdf_df.income.std() # Weighted standard deviation
Inequality metrics:
mdf_df.income.gini() # Gini coefficient
mdf_df.income.top_x_pct_share(10) # Top 10% share
mdf_df.income.top_x_pct_share(1) # Top 1% share
Poverty analysis:
# Poverty rate (income < threshold)
poverty_rate = mdf_df.poverty_rate(
income_measure='income',
threshold=poverty_line
)
# Poverty gap (how far below threshold)
poverty_gap = mdf_df.poverty_gap(
income_measure='income',
threshold=poverty_line
)
# Deep poverty (income < 50% of threshold)
deep_poverty_rate = mdf_df.deep_poverty_rate(
income_measure='income',
threshold=poverty_line,
deep_poverty_line=0.5
)
Quantiles:
# Deciles
mdf_df.income.decile_values()
# Quintiles
mdf_df.income.quintile_values()
# Custom quantiles
mdf_df.income.quantile(0.25) # 25th percentile
MicroSeries
# Extract a Series with weights
income_series = mdf_df.income # This is a MicroSeries
# MicroSeries operations
income_series.mean()
income_series.gini()
income_series.percentile(50)
Working with PolicyEngine Results
import microdf as mdf
from policyengine_us import Simulation
# Run simulation with axes (multiple households)
situation_with_axes = {...} # See policyengine-us-skill
sim = Simulation(situation=situation_with_axes)
# Get results as arrays
incomes = sim.calculate("household_net_income", 2024)
weights = sim.calculate("household_weight", 2024)
# Create MicroDataFrame
df = pd.DataFrame({'income': incomes, 'weight': weights})
mdf_df = mdf.MicroDataFrame(df, weights='weight')
# Calculate metrics
gini = mdf_df.income.gini()
poverty_rate = mdf_df.poverty_rate('income', threshold=15000)
print(f"Gini: {gini:.3f}")
print(f"Poverty rate: {poverty_rate:.1%}")
For Contributors 💻
Repository
Location: PolicyEngine/microdf
Clone:
git clone https://github.com/PolicyEngine/microdf
cd microdf
Current Implementation
To see current API:
# Main classes
cat microdf/microframe.py # MicroDataFrame
cat microdf/microseries.py # MicroSeries
# Key modules
cat microdf/generic.py # Generic weighted operations
cat microdf/inequality.py # Gini, top shares
cat microdf/poverty.py # Poverty metrics
To see all methods:
# MicroDataFrame methods
grep "def " microdf/microframe.py
# MicroSeries methods
grep "def " microdf/microseries.py
Testing
To see test patterns:
ls tests/
cat tests/test_microframe.py
Run tests:
make test
# Or
pytest tests/ -v
Contributing
Before contributing:
- Check if method already exists
- Ensure it's weighted correctly
- Add tests
- Follow policyengine-standards-skill
Common contributions:
- New inequality metrics
- New poverty measures
- Performance optimizations
- Bug fixes
Advanced Patterns
Custom Aggregations
# Define custom weighted aggregation
def weighted_operation(series, weights):
return (series * weights).sum() / weights.sum()
# Apply to MicroSeries
result = weighted_operation(mdf_df.income, mdf_df.weights)
Groupby Operations
# Group by with weights
grouped = mdf_df.groupby('state')
state_means = grouped.income.mean() # Weighted means by state
Inequality Decomposition
To see decomposition methods:
grep -A 20 "def.*decomp" microdf/
Integration Examples
Example 1: PolicyEngine Blog Post Analysis
# Pattern from PolicyEngine blog posts
import microdf as mdf
# Get simulation results
baseline_income = baseline_sim.calculate("household_net_income", 2024)
reform_income = reform_sim.calculate("household_net_income", 2024)
weights = baseline_sim.calculate("household_weight", 2024)
# Create MicroDataFrame
df = pd.DataFrame({
'baseline_income': baseline_income,
'reform_income': reform_income,
'weight': weights
})
mdf_df = mdf.MicroDataFrame(df, weights='weight')
# Calculate impacts
baseline_gini = mdf_df.baseline_income.gini()
reform_gini = mdf_df.reform_income.gini()
print(f"Gini change: {reform_gini - baseline_gini:+.4f}")
Example 2: Poverty Analysis
# Calculate poverty under baseline and reform
from policyengine_us import Simulation
baseline_sim = Simulation(situation=situation)
reform_sim = Simulation(situation=situation, reform=reform)
# Get incomes
baseline_income = baseline_sim.calculate("spm_unit_net_income", 2024)
reform_income = reform_sim.calculate("spm_unit_net_income", 2024)
spm_threshold = baseline_sim.calculate("spm_unit_poverty_threshold", 2024)
weights = baseline_sim.calculate("spm_unit_weight", 2024)
# Calculate poverty rates
df_baseline = mdf.MicroDataFrame(
pd.DataFrame({'income': baseline_income, 'threshold': spm_threshold, 'weight': weights}),
weights='weight'
)
poverty_baseline = (df_baseline.income < df_baseline.threshold).mean() # Weighted
# Similar for reform
print(f"Poverty reduction: {(poverty_baseline - poverty_reform):.1%}")
Package Status
Maturity: Stable, production-ready API stability: Stable (rarely breaking changes) Performance: Optimized for large datasets
To see version:
pip show microdf-python
To see changelog:
cat CHANGELOG.md # In microdf repo
Related Skills
- policyengine-us-skill - Generating data for microdf analysis
- policyengine-analysis-skill - Using microdf in policy analysis
- policyengine-us-data-skill - Data sources for microdf
Resources
Repository: https://github.com/PolicyEngine/microdf PyPI: https://pypi.org/project/microdf-python/ Issues: https://github.com/PolicyEngine/microdf/issues
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
