Scfgsea
by pwwang
Performs fast Gene Set Enrichment Analysis (GSEA) on single-cell data using fgsea R package. Identifies enriched biological pathways by ranking genes based on differential expression between cell groups. Generates enrichment scores, significance metrics, and publication-ready visualizations.
Skill Details
Repository Files
1 file in this skill directory
name: scfgsea description: Performs fast Gene Set Enrichment Analysis (GSEA) on single-cell data using fgsea R package. Identifies enriched biological pathways by ranking genes based on differential expression between cell groups. Generates enrichment scores, significance metrics, and publication-ready visualizations.
ScFGSEA Process Configuration
Purpose
Performs fast Gene Set Enrichment Analysis (GSEA) on single-cell data using fgsea R package. Identifies enriched biological pathways by ranking genes based on differential expression between cell groups. Generates enrichment scores, significance metrics, and publication-ready visualizations.
When to Use
- After clustering: Functional interpretation of cluster differences
- Pathway analysis: Identify biological processes driving cell type differentiation
- Comparative analysis: Compare gene expression patterns between groups (e.g., disease vs control)
- Subgroup analysis: Run GSEA on metadata subsets (diagnosis, treatment, etc.)
- TCR integration: Analyze pathway enrichment in TCR-selected clones/clusters
Configuration Structure
Process Enablement
[ScFGSEA]
cache = true
Input Specification
[ScFGSEA.in]
srtobj = ["SeuratClustering"] # or "ScRepCombiningExpression"
Environment Variables
[ScFGSEA.envs]
# Core parameters
ncores = 1 # Parallel cores
assay = "RNA" # Assay to use
subset = "seurat_clusters %in% c('c1', 'c2')" # Subset cells
# Grouping parameters
group_by = "seurat_clusters" # Column to compare
ident_1 = "c1" # First group
ident_2 = "c2" # Second group (optional: uses all others)
each = "seurat_clusters" # Split into multiple cases
# Gene set database
gmtfile = "KEGG_2021_Human" # Default
# Ranking method
method = "s2n" # signal-to-noise (default)
# fgsea parameters
minsize = 10 # Min gene set size
maxsize = 100 # Max gene set size
top = 20 # Top pathways to plot (< 1 for padj threshold)
eps = 0.0 # P-value boundary
# Visualization
[ScFGSEA.envs.alleach_plots.Heatmap]
plot_type = "heatmap"
group_by = "Diagnosis"
Gene Set Databases
MSigDB Collections
- H (Hallmark): 50 curated, non-redundant gene sets →
"MSigDB_Hallmark_2020" - C2 (Curated): 7,411 gene sets from pathway databases
- CP:KEGG →
"KEGG_2021_Human" - CP:REACTOME →
"Reactome_Pathways_2024" - CP:BIOCARTA →
"BioCarta_2016" - CP:WIKIPATHWAYS →
"WikiPathways_2024_Human"
- CP:KEGG →
- C5 (GO): 18,807 Gene Ontology terms
- BP →
"GO_Biological_Process_2025" - CC →
"GO_Cellular_Component_2025" - MF →
"GO_Molecular_Function_2025"
- BP →
- C7 (Immunologic): 2,497 immune-specific signatures (use custom GMT)
Custom GMT Files
gmtfile = "/path/to/custom.gmt"
Format: name<tab>description<tab>gene1,gene2,...
Ranking Methods
"s2n"/"signal_to_noise": Signal-to-noise ratio (default)"abs_s2n"/"abs_signal_to_noise": Absolute signal-to-noise"t_test": Student's t-test"ratio_of_classes": Fold change (natural scale)"diff_of_classes": Difference of means"log2_ratio_of_classes": Log2 fold change (recommended for log-scale RNA-seq)
Configuration Examples
Minimal Configuration
[ScFGSEA]
[ScFGSEA.in]
srtobj = ["SeuratClustering"]
[ScFGSEA.envs]
group_by = "seurat_clusters"
ident_1 = "c1"
ident_2 = "c2"
Standard Hallmark Analysis
[ScFGSEA.envs]
gmtfile = "MSigDB_Hallmark_2020"
group_by = "Diagnosis"
ident_1 = "Disease"
ident_2 = "Control"
each = "seurat_clusters"
method = "s2n"
top = 20
KEGG Pathways with Custom Thresholds
[ScFGSEA.envs]
gmtfile = "KEGG_2021_Human"
group_by = "Treatment"
ident_1 = "Treated"
ident_2 = "Untreated"
minsize = 15
maxsize = 200
method = "log2_ratio_of_classes"
GO Biological Process
[ScFGSEA.envs]
gmtfile = "GO_Biological_Process_2025"
group_by = "Diagnosis"
ident_1 = "Colitis"
ident_2 = "Control"
minsize = 10
maxsize = 500
top = 0.05 # padj < 0.05
Immunologic Signatures (Custom GMT)
[ScFGSEA.envs]
gmtfile = "/data/gmt/MSigDB_C7_Immunologic_Signatures.gmt"
group_by = "tissue_type"
ident_1 = "Inflamed"
ident_2 = "Normal"
minsize = 5
maxsize = 150
Multiple Database Comparison
[ScFGSEA.envs.cases.Hallmark]
gmtfile = "MSigDB_Hallmark_2020"
ident_1 = "Disease"
ident_2 = "Control"
[ScFGSEA.envs.cases.KEGG]
gmtfile = "KEGG_2021_Human"
ident_1 = "Disease"
ident_2 = "Control"
TCR Clonotype Analysis
[ScFGSEA.in]
srtobj = ["ScRepCombiningExpression"]
[ScFGSEA.envs]
group_by = "cdr3_clonotype_cluster"
ident_1 = "expanded_clone"
ident_2 = "rest"
gmtfile = "MSigDB_Hallmark_2020"
subset = "CD4"
Common Patterns
Pattern 1: Standard Cluster Comparison
[ScFGSEA.envs]
gmtfile = "MSigDB_Hallmark_2020"
group_by = "seurat_clusters"
ident_1 = "c1"
ident_2 = "c2"
Pattern 2: Disease vs Control with Multiple Clusters
[ScFGSEA.envs]
group_by = "Diagnosis"
ident_1 = "Disease"
ident_2 = "Control"
each = "seurat_clusters"
gmtfile = "KEGG_2021_Human"
Pattern 3: Log2 Fold Change Ranking
[ScFGSEA.envs]
method = "log2_ratio_of_classes"
gmtfile = "MSigDB_Hallmark_2020"
Pattern 4: Stringent Pathway Size Filter
[ScFGSEA.envs]
minsize = 20
maxsize = 150
gmtfile = "Reactome_Pathways_2024"
Pattern 5: P-Value Threshold for Plots
[ScFGSEA.envs]
top = 0.01 # padj < 0.01 only
gmtfile = "MSigDB_Hallmark_2020"
Pattern 6: Custom Metabolic Pathways
[ScFGSEA.envs]
gmtfile = "/data/gmt/KEGG_Metabolism.gmt"
group_by = "Metabolic_State"
ident_1 = "High"
ident_2 = "Low"
Dependencies
- Upstream:
SeuratClusteringorScRepCombiningExpression - Downstream:
CellTypeAnnotation, pathway visualization
Validation Rules
gmtfile: Valid enrichit name or GMT pathgroup_by: Valid metadata columnident_1/ident_2: Values must exist ingroup_byminsize: ≥ 1,maxsize: > minsizetop: > 0 or < 1 (padj threshold)method: Valid fgsea ranking method
Troubleshooting
Too Few Pathways Enriched
[ScFGSEA.envs]
minsize = 5 # Smaller pathways
maxsize = 500 # Larger pathways
top = 0.1 # Looser threshold
gmtfile = "GO_Biological_Process_2025" # More gene sets
No Enrichment Results
Causes: Insufficient cells, gene name mismatch, restrictive thresholds Solutions:
[ScFGSEA.envs]
minsize = 10
maxsize = 200
subset = "group_by_count > 10"
Long Computation Time
[ScFGSEA.envs]
minsize = 20
maxsize = 100
gmtfile = "MSigDB_Hallmark_2020"
ncores = 8
subset = "seurat_clusters %in% c('c1', 'c2')"
Gene Name Mismatch
Cause: Human (GENE) vs mouse (Gene), different ID types Solutions:
- Download species-specific GMT from MSigDB
- Check
rownames(seurat_object) - Ensure consistent formatting (uppercase for human)
Best Practices
- Start with Hallmark for quick, interpretable results
- Use
log2_ratio_of_classesfor log-scale RNA-seq data - Adjust
minsize/maxsizebased on database and research question - Use multiple databases for comprehensive coverage
- Verify gene names match between Seurat and GMT files
- Use
eachparameter for multiple subgroup comparisons - Set
top < 1for p-value-based filtering - Validate cell counts before running GSEA
- Parallelize with
ncoresfor large datasets - Cache results when testing visualization parameters
External References
- fgsea: https://bioconductor.org/packages/release/bioc/html/fgsea.html
- MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/
- enrichit: https://pwwang.github.io/enrichit/reference/FetchGMT.html
- GSEA paper: Subramanian et al. 2005, PNAS
Related Processes
- ClusterMarkers: Differential expression (provides ranked genes)
- MarkersFinder: Flexible marker finding with GSEA
- PseudoBulkDEG: Bulk-like DE with GSEA
- ModuleScoreCalculator: Score pathway genes across cells
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
