R Data Science
by benchflow-ai
R programming for data analysis, visualization, and statistical workflows. Use when working with R scripts (.R), Quarto documents (.qmd), RMarkdown (.Rmd), or R projects. Covers tidyverse workflows, ggplot2 visualizations, statistical analysis, epidemiological methods, and reproducible research practices.
Skill Details
Repository Files
9 files in this skill directory
name: r-data-science description: "R programming for data analysis, visualization, and statistical workflows. Use when working with R scripts (.R), Quarto documents (.qmd), RMarkdown (.Rmd), or R projects. Covers tidyverse workflows, ggplot2 visualizations, statistical analysis, epidemiological methods, and reproducible research practices."
R Data Science
Overview
Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.
Core Principles
- Tidyverse-first: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach
- Pipe-forward: Use the native pipe
|>for chains (R 4.1+); fall back to%>%for older versions - Reproducibility: Structure all work for reproducibility using Quarto, renv, and clear documentation
- Defensive coding: Validate inputs, handle missing data explicitly, and fail informatively
Quick Reference: Common Patterns
Data Import
library(tidyverse)
# CSV (most common)
df <- read_csv("data/raw/dataset.csv")
# Excel
df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")
# Clean column names immediately
df <- df |> janitor::clean_names()
Data Wrangling Pipeline
analysis_data <- raw_data |>
# Clean and filter
filter(!is.na(key_variable)) |>
# Transform variables
mutate(
date = as.Date(date_string, format = "%Y-%m-%d"),
age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
labels = c("0-17", "18-44", "45-64", "65+"))
) |>
# Summarize
group_by(region, age_group) |>
summarize(
n = n(),
mean_value = mean(outcome, na.rm = TRUE),
.groups = "drop"
)
Basic ggplot2 Visualization
ggplot(df, aes(x = date, y = count, color = category)) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set2") +
labs(
title = "Trend Over Time",
subtitle = "By category",
x = "Date",
y = "Count",
color = "Category",
caption = "Source: Dataset Name"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold")
)
Tidyverse Style Guide Essentials
Naming Conventions
- snake_case for objects and functions:
case_counts,calculate_rate() - Verbs for functions:
filter_outliers(),compute_summary() - Nouns for data:
patient_data,surveillance_df - Avoid: dots in names (reserved for S3), single letters except in lambdas
Code Formatting
- Indentation: 2 spaces (never tabs)
- Line length: 80 characters maximum
- Operators: Spaces around
<-,=,+,|>, but not:,::,$ - Commas: Space after, never before
- Pipes: New line after each
|>
# Good
result <- data |>
filter(year >= 2020) |>
group_by(county) |>
summarize(total = sum(cases))
# Bad
result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))
Assignment
- Use
<-for assignment, never=or-> - Use
=only for function arguments
Comments
# Load and clean surveillance data ------------------------------------------
# Calculate age-adjusted rates
# Using direct standardization method per CDC guidelines
adjusted_rate <- calculate_adjusted_rate(df, standard_pop)
Package Ecosystem
Core Tidyverse (Always Load)
library(tidyverse) # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
Data Import/Export
| Task | Package | Key Functions |
|---|---|---|
| CSV/TSV | readr | read_csv(), write_csv() |
| Excel | readxl, writexl | read_excel(), write_xlsx() |
| SAS/SPSS/Stata | haven | read_sas(), read_spss(), read_stata() |
| JSON | jsonlite | read_json(), fromJSON() |
| Databases | DBI, dbplyr | dbConnect(), tbl() |
Data Manipulation
| Task | Package | Key Functions |
|---|---|---|
| Column cleaning | janitor | clean_names(), tabyl() |
| Date handling | lubridate | ymd(), mdy(), floor_date() |
| String operations | stringr | str_detect(), str_extract() |
| Missing data | naniar | vis_miss(), replace_with_na() |
Visualization
| Task | Package | Key Functions |
|---|---|---|
| Core plotting | ggplot2 | ggplot(), geom_*() |
| Extensions | ggrepel, patchwork | geom_text_repel(), + operator |
| Interactive | plotly | ggplotly() |
| Tables | gt, kableExtra | gt(), kable() |
Statistical Analysis
| Task | Package | Key Functions |
|---|---|---|
| Model summaries | broom | tidy(), glance(), augment() |
| Regression | stats, lme4 | lm(), glm(), lmer() |
| Survival | survival | Surv(), survfit(), coxph() |
| Survey data | survey | svydesign(), svymean() |
Epidemiology & Public Health
| Task | Package | Key Functions |
|---|---|---|
| Epi calculations | epiR | epi.2by2(), epi.conf() |
| Outbreak tools | incidence2, epicontacts | incidence(), make_epicontacts() |
| Disease mapping | SpatialEpi | expected(), EBlocal() |
| Surveillance | surveillance | sts(), farrington() |
| Rate calculations | epitools | riskratio(), oddsratio(), ageadjust.direct() |
Reproducibility Standards
Project Structure
project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md # Claude Code configuration
├── README.md
├── data/
│ ├── raw/ # Never modify
│ └── processed/ # Analysis-ready
├── R/ # Custom functions
├── scripts/ # Pipeline scripts
├── analysis/ # Quarto documents
└── output/
├── figures/
└── tables/
Quarto Document Header
---
title: "Analysis Title"
author: "Your Name"
date: today
format:
html:
toc: true
code-fold: true
embed-resources: true
execute:
warning: false
message: false
---
Package Management with renv
# Initialize (once per project)
renv::init()
# Snapshot dependencies after installing packages
renv::snapshot()
# Restore environment (for collaborators)
renv::restore()
Workflow Documentation
Always include at the top of scripts:
# ============================================================================
# Title: Analysis of [Subject]
# Author: [Name]
# Date: [Date]
# Purpose: [One-sentence description]
# Input: data/processed/clean_data.csv
# Output: output/figures/trend_plot.png
# ============================================================================
Common Analysis Patterns
Descriptive Statistics Table
df |>
group_by(category) |>
summarize(
n = n(),
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
q25 = quantile(value, 0.25, na.rm = TRUE),
q75 = quantile(value, 0.75, na.rm = TRUE)
) |>
gt::gt() |>
gt::fmt_number(columns = where(is.numeric), decimals = 2)
Regression with Tidy Output
model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)
# Tidy coefficients
tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |>
select(term, estimate, conf.low, conf.high, p.value)
# Model diagnostics
glance_results <- broom::glance(model)
Epi Curve (Epidemic Curve)
library(incidence2)
# Create incidence object
inc <- incidence(
df,
date_index = "onset_date",
interval = "week",
groups = "outcome_category"
)
# Plot
plot(inc) +
labs(
title = "Epidemic Curve",
x = "Week of Onset",
y = "Number of Cases"
) +
theme_minimal()
Rate Calculation
# Age-adjusted rates using direct standardization
library(epitools)
# Stratum-specific counts and populations
result <- ageadjust.direct(
count = df$cases,
pop = df$population,
stdpop = standard_population$pop # e.g., US 2000 standard
)
Error Handling
Defensive Data Checks
# Validate data before analysis
stopifnot(
"Data frame is empty" = nrow(df) > 0,
"Missing required columns" = all(c("id", "date", "value") %in% names(df)),
"Duplicate IDs found" = !any(duplicated(df$id))
)
# Informative warnings for data quality issues
if (sum(is.na(df$key_var)) > 0) {
warning(sprintf("%d missing values in key_var (%.1f%%)",
sum(is.na(df$key_var)),
100 * mean(is.na(df$key_var))))
}
Safe File Operations
# Check file exists before reading
if (!file.exists(filepath)) {
stop(sprintf("File not found: %s", filepath))
}
# Create directories if needed
dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)
Performance Tips
For Large Datasets
# Use data.table for >1M rows
library(data.table)
dt <- fread("large_file.csv")
# Or use arrow for very large/parquet files
library(arrow)
df <- read_parquet("data.parquet")
# Lazy evaluation with duckdb
library(duckdb)
con <- dbConnect(duckdb())
df_lazy <- tbl(con, "data.csv")
Vectorization Over Loops
# Good: vectorized
df$rate <- df$cases / df$population * 100000
# Avoid: row-by-row loop
for (i in 1:nrow(df)) {
df$rate[i] <- df$cases[i] / df$population[i] * 100000
}
Additional Resources
For detailed patterns, consult:
- Tidyverse Style Guide: https://style.tidyverse.org/
- R for Data Science (2e): https://r4ds.hadley.nz/
- The Epidemiologist R Handbook: https://epirhandbook.com/
- Quarto Documentation: https://quarto.org/
Version History
- v1.0.0 (2025-12-04): Initial release for PubHealthAI community
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Team Composition Analysis
This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.
Startup Financial Modeling
This skill should be used when the user asks to "create financial projections", "build a financial model", "forecast revenue", "calculate burn rate", "estimate runway", "model cash flow", or requests 3-5 year financial planning for a startup.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Startup Metrics Framework
This skill should be used when the user asks about "key startup metrics", "SaaS metrics", "CAC and LTV", "unit economics", "burn multiple", "rule of 40", "marketplace metrics", or requests guidance on tracking and optimizing business performance metrics.
