name: data-journalism description: Data journalism workflows for analysis, visualization, and storytelling. Use when analyzing datasets, creating charts and maps, cleaning messy data, calculating statistics, or building data-driven stories. Essential for reporters, newsrooms, and researchers working with quantitative information.

Data journalism methodology

Systematic approaches for finding, analyzing, and presenting data in journalism.

Data acquisition

Public data sources

## Federal data sources

### General
- Data.gov - Federal open data portal
- Census Bureau (census.gov) - Demographics, economic data
- BLS (bls.gov) - Employment, inflation, wages
- BEA (bea.gov) - GDP, economic accounts
- Federal Reserve (federalreserve.gov) - Financial data
- SEC EDGAR - Corporate filings

### Specific domains
- EPA (epa.gov/data) - Environmental data
- FDA (fda.gov/data) - Drug approvals, recalls, adverse events
- CDC WONDER - Health statistics
- NHTSA - Vehicle safety data
- DOT - Transportation statistics
- FEC - Campaign finance
- USASpending.gov - Federal contracts and grants

### State and local
- State open data portals (search: "[state] open data")
- Socrata-powered sites (many cities/states)
- OpenStreets, municipal GIS portals
- State comptroller/auditor reports

Data request strategies

## Getting data that isn't public

### FOIA for datasets
- Request databases, not just documents
- Ask for data dictionary/schema
- Request in native format (CSV, SQL dump)
- Specify field-level needs

### Building your own dataset
- Scraping public information
- Crowdsourcing from readers
- Systematic document review
- Surveys (with proper methodology)

### Commercial data sources (for newsrooms)
- LexisNexis
- Refinitiv
- Bloomberg
- Industry-specific databases

Data cleaning and preparation

Common data problems

import pandas as pd
import numpy as np

# Load messy data
df = pd.read_csv('raw_data.csv')

# 1. INCONSISTENT FORMATTING
# Problem: Names in different formats
# "SMITH, JOHN" vs "John Smith" vs "smith john"

def standardize_name(name):
    """Standardize name format to 'First Last'."""
    if pd.isna(name):
        return None
    name = str(name).strip().lower()
    # Handle "LAST, FIRST" format
    if ',' in name:
        parts = name.split(',')
        name = f"{parts[1].strip()} {parts[0].strip()}"
    return name.title()

df['name_clean'] = df['name'].apply(standardize_name)


# 2. DATE INCONSISTENCIES
# Problem: Dates in multiple formats
# "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"

def parse_date(date_str):
    """Parse dates in various formats."""
    if pd.isna(date_str):
        return None

    formats = [
        '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
        '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
    ]

    for fmt in formats:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue

    # Fall back to pandas parser
    try:
        return pd.to_datetime(date_str)
    except:
        return None

df['date_clean'] = df['date'].apply(parse_date)


# 3. MISSING VALUES
# Strategy depends on context

# Check missing value patterns
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)  # Percentage

# Options:
# - Drop rows with critical missing values
df_clean = df.dropna(subset=['required_field'])

# - Fill with appropriate values
df['category'] = df['category'].fillna('Unknown')
df['amount'] = df['amount'].fillna(df['amount'].median())

# - Flag as missing (preserve for analysis)
df['amount_missing'] = df['amount'].isna()


# 4. DUPLICATES
# Find and handle duplicates

# Exact duplicates
print(f"Exact duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()

# Fuzzy duplicates (similar but not identical)
# Use record linkage or manual review
from fuzzywuzzy import fuzz

def find_similar_names(names, threshold=85):
    """Find potentially duplicate names."""
    duplicates = []
    for i, name1 in enumerate(names):
        for j, name2 in enumerate(names[i+1:], i+1):
            score = fuzz.ratio(str(name1).lower(), str(name2).lower())
            if score >= threshold:
                duplicates.append((name1, name2, score))
    return duplicates


# 5. OUTLIERS
# Identify potential data entry errors

def flag_outliers(series, method='iqr', threshold=1.5):
    """Flag statistical outliers."""
    if method == 'iqr':
        Q1 = series.quantile(0.25)
        Q3 = series.quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - threshold * IQR
        upper = Q3 + threshold * IQR
        return (series < lower) | (series > upper)
    elif method == 'zscore':
        z_scores = np.abs((series - series.mean()) / series.std())
        return z_scores > threshold

df['amount_outlier'] = flag_outliers(df['amount'])
print(f"Outliers found: {df['amount_outlier'].sum()}")


# 6. DATA TYPE CORRECTIONS
# Ensure proper types for analysis

# Convert to numeric (handling errors)
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# Convert to categorical (saves memory, enables ordering)
df['status'] = pd.Categorical(df['status'],
                              categories=['Pending', 'Active', 'Closed'],
                              ordered=True)

# Convert to datetime
df['date'] = pd.to_datetime(df['date'], errors='coerce')

Data validation checklist

## Pre-analysis data validation

### Structural checks
- [ ] Row count matches expected
- [ ] Column count and names correct
- [ ] Data types appropriate
- [ ] No unexpected null columns

### Content checks
- [ ] Date ranges make sense
- [ ] Numeric values within expected bounds
- [ ] Categorical values match expected options
- [ ] Geographic data resolves correctly
- [ ] IDs are unique where expected

### Consistency checks
- [ ] Totals add up to expected values
- [ ] Cross-tabulations balance
- [ ] Related fields are consistent
- [ ] Time series is continuous

### Source verification
- [ ] Can trace back to original source
- [ ] Methodology documented
- [ ] Known limitations noted
- [ ] Update frequency understood

Statistical analysis for journalism

Basic statistics with context

# Essential statistics for any dataset
def describe_for_journalism(df, column):
    """Generate journalist-friendly statistics."""
    stats = {
        'count': len(df[column].dropna()),
        'missing': df[column].isna().sum(),
        'min': df[column].min(),
        'max': df[column].max(),
        'mean': df[column].mean(),
        'median': df[column].median(),
        'std': df[column].std(),
    }

    # Percentiles for context
    stats['25th_percentile'] = df[column].quantile(0.25)
    stats['75th_percentile'] = df[column].quantile(0.75)
    stats['90th_percentile'] = df[column].quantile(0.90)
    stats['99th_percentile'] = df[column].quantile(0.99)

    # Distribution shape
    stats['skewness'] = df[column].skew()

    return stats

# Example interpretation
stats = describe_for_journalism(df, 'salary')
print(f"""
SALARY ANALYSIS
---------------
We analyzed {stats['count']:,} salary records.

The median salary is ${stats['median']:,.0f}, meaning half of workers
earn more and half earn less.

The average salary is ${stats['mean']:,.0f}, which is
{'higher' if stats['mean'] > stats['median'] else 'lower'} than the median,
indicating the distribution is {'right-skewed (pulled up by high earners)'
if stats['skewness'] > 0 else 'left-skewed'}.

The top 10% of earners make at least ${stats['90th_percentile']:,.0f}.
The top 1% make at least ${stats['99th_percentile']:,.0f}.
""")

Comparisons and context

# Year-over-year change
def calculate_change(current, previous):
    """Calculate change with multiple metrics."""
    absolute = current - previous
    if previous != 0:
        percent = (current - previous) / previous * 100
    else:
        percent = float('inf') if current > 0 else 0

    return {
        'current': current,
        'previous': previous,
        'absolute_change': absolute,
        'percent_change': percent,
        'direction': 'increased' if absolute > 0 else 'decreased' if absolute < 0 else 'unchanged'
    }

# Per capita calculations (essential for fair comparisons)
def per_capita(value, population):
    """Calculate per capita rate."""
    return (value / population) * 100000  # Per 100,000 is standard

# Example: Crime rates
city_a = {'crimes': 5000, 'population': 100000}
city_b = {'crimes': 8000, 'population': 500000}

rate_a = per_capita(city_a['crimes'], city_a['population'])
rate_b = per_capita(city_b['crimes'], city_b['population'])

print(f"City A: {rate_a:.1f} crimes per 100,000 residents")
print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
# City A actually has higher crime rate despite fewer total crimes!


# Inflation adjustment
def adjust_for_inflation(amount, from_year, to_year, cpi_data):
    """Adjust dollar amounts for inflation."""
    from_cpi = cpi_data[from_year]
    to_cpi = cpi_data[to_year]
    return amount * (to_cpi / from_cpi)

# Always adjust when comparing dollars across years!

Correlation vs causation

## Reporting correlations responsibly

### What you CAN say
- "X and Y are correlated"
- "As X increases, Y tends to increase"
- "Areas with higher X also tend to have higher Y"
- "X is associated with Y"

### What you CANNOT say (without more evidence)
- "X causes Y"
- "X leads to Y"
- "Y happens because of X"

### Questions to ask before implying causation
1. Is there a plausible mechanism?
2. Does the timing make sense (cause before effect)?
3. Is there a dose-response relationship?
4. Has the finding been replicated?
5. Have confounding variables been controlled?
6. Are there alternative explanations?

### Red flags for spurious correlations
- Extremely high correlation (r > 0.95) with unrelated things
- No logical connection between variables
- Third variable could explain both
- Small sample size with high variance

Data visualization

Chart selection guide

## Choosing the right chart

### Comparison
- **Bar chart**: Compare categories
- **Grouped bar**: Compare categories across groups
- **Bullet chart**: Actual vs target

### Change over time
- **Line chart**: Trends over time
- **Area chart**: Cumulative totals over time
- **Slope chart**: Change between two points

### Distribution
- **Histogram**: Distribution of one variable
- **Box plot**: Compare distributions across groups
- **Violin plot**: Detailed distribution shape

### Relationship
- **Scatter plot**: Relationship between two variables
- **Bubble chart**: Three variables (x, y, size)
- **Connected scatter**: Change in relationship over time

### Composition
- **Pie chart**: Parts of a whole (use sparingly, max 5 slices)
- **Stacked bar**: Parts of whole across categories
- **Treemap**: Hierarchical composition

### Geographic
- **Choropleth**: Values by region (use normalized data!)
- **Dot map**: Individual locations
- **Proportional symbol**: Magnitude at locations

Visualization best practices

import matplotlib.pyplot as plt
import seaborn as sns

# Journalist-friendly chart defaults
plt.rcParams.update({
    'figure.figsize': (10, 6),
    'font.size': 12,
    'axes.titlesize': 16,
    'axes.labelsize': 12,
    'axes.spines.top': False,
    'axes.spines.right': False,
})

def create_bar_chart(data, title, source, xlabel='', ylabel=''):
    """Create a publication-ready bar chart."""
    fig, ax = plt.subplots()

    # Create bars
    bars = ax.bar(data.keys(), data.values(), color='#2c7bb6')

    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:,.0f}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    ha='center', va='bottom',
                    fontsize=10)

    # Labels and title
    ax.set_title(title, fontweight='bold', pad=20)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)

    # Add source annotation
    fig.text(0.99, 0.01, f'Source: {source}',
             ha='right', va='bottom', fontsize=9, color='gray')

    plt.tight_layout()
    return fig

# Example
data = {'2020': 1200, '2021': 1450, '2022': 1380, '2023': 1620}
fig = create_bar_chart(data,
                       'Annual Widget Production',
                       'Department of Widgets, 2024',
                       ylabel='Units produced')
fig.savefig('chart.png', dpi=150, bbox_inches='tight')

Avoiding misleading visualizations

## Chart integrity checklist

### Axes
- [ ] Y-axis starts at zero (for bar charts)
- [ ] Axis labels are clear
- [ ] Scale is appropriate (not truncated to exaggerate)
- [ ] Both axes labeled with units

### Data representation
- [ ] All data points visible
- [ ] Colors are distinguishable (including colorblind)
- [ ] Proportions are accurate
- [ ] 3D effects not distorting perception

### Context
- [ ] Title describes what's shown, not conclusion
- [ ] Time period clearly stated
- [ ] Source cited
- [ ] Sample size/methodology noted if relevant
- [ ] Uncertainty shown where appropriate

### Honesty
- [ ] Cherry-picking dates avoided
- [ ] Outliers explained, not hidden
- [ ] Dual axes justified (usually avoid)
- [ ] Annotations don't mislead

Story structure for data journalism

Data story framework

## The data story arc

### 1. The hook (nut graf)
- What's the key finding?
- Why should readers care?
- What's the human impact?

### 2. The evidence
- Show the data
- Explain the methodology
- Acknowledge limitations

### 3. The context
- How does this compare to past?
- How does this compare to elsewhere?
- What's the trend?

### 4. The human element
- Individual examples that illustrate the data
- Expert interpretation
- Affected voices

### 5. The implications
- What does this mean going forward?
- What questions remain?
- What actions could result?

### 6. The methodology box
- Where did data come from?
- How was it analyzed?
- What are the limitations?
- How can readers explore further?

Methodology documentation template

## How we did this analysis

### Data sources
[List all data sources with links and access dates]

### Time period
[Specify exactly what time period is covered]

### Definitions
[Define key terms and how you operationalized them]

### Analysis steps
1. [First step of analysis]
2. [Second step]
3. [Continue...]

### Limitations
- [Limitation 1]
- [Limitation 2]

### What we excluded and why
- [Excluded category]: [Reason]

### Verification
[How findings were verified/checked]

### Code and data availability
[Link to GitHub repo if sharing code/data]

### Contact
[How readers can reach you with questions]

Tools and resources

Essential tools

Tool	Purpose	Cost
Python + pandas	Data analysis	Free
R + tidyverse	Statistical analysis	Free
Excel/Sheets	Quick analysis	Free/Low
Datawrapper	Charts for web	Free tier
Flourish	Interactive viz	Free tier
QGIS	Mapping	Free
Tabula	PDF table extraction	Free
OpenRefine	Data cleaning	Free

Learning resources

NICAR (Investigative Reporters & Editors)
Knight Center for Journalism in the Americas
Data Journalism Handbook (datajournalism.com)
Flowing Data (flowingdata.com)
The Pudding (pudding.cool) - examples

Data Journalism

Skill Details

Data journalism methodology

Data acquisition

Public data sources

Data request strategies

Data cleaning and preparation

Common data problems

Data validation checklist

Statistical analysis for journalism

Basic statistics with context

Comparisons and context

Correlation vs causation

Data visualization

Chart selection guide

Visualization best practices

Avoiding misleading visualizations

Story structure for data journalism

Data story framework

Methodology documentation template

Tools and resources

Essential tools

Learning resources

Related Skills

Xlsx

Clickhouse Io

Clickhouse Io

Analyzing Financial Statements

Data Storytelling

Team Composition Analysis

Startup Financial Modeling

Kpi Dashboard Design

Dbt Transformation Patterns

Startup Metrics Framework

Skill Information