Single Cell Rna Qc
by anthropics
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
Skill Details
Repository Files
6 files in this skill directory
name: single-cell-rna-qc description: Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
Single-Cell RNA-seq Quality Control
Automated QC workflow for single-cell RNA-seq data following scverse best practices.
When to Use This Skill
Use when users:
- Request quality control or QC on single-cell RNA-seq data
- Want to filter low-quality cells or assess data quality
- Need QC visualizations or metrics
- Ask to follow scverse/scanpy best practices
- Request MAD-based filtering or outlier detection
Supported input formats:
.h5adfiles (AnnData format from scanpy/Python workflows).h5files (10X Genomics Cell Ranger output)
Default recommendation: Use Approach 1 (complete pipeline) unless the user has specific custom requirements or explicitly requests non-standard filtering logic.
Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)
For standard QC following scverse best practices, use the convenience script scripts/qc_analysis.py:
python3 scripts/qc_analysis.py input.h5ad
# or for 10X Genomics .h5 files:
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5
The script automatically detects the file format and loads it appropriately.
When to use this approach:
- Standard QC workflow with adjustable thresholds (all cells filtered the same way)
- Batch processing multiple datasets
- Quick exploratory analysis
- User wants the "just works" solution
Requirements: anndata, scanpy, scipy, matplotlib, seaborn, numpy
Parameters:
Customize filtering thresholds and gene patterns using command-line parameters:
--output-dir- Output directory--mad-counts,--mad-genes,--mad-mt- MAD thresholds for counts/genes/MT%--mt-threshold- Hard mitochondrial % cutoff--min-cells- Gene filtering threshold--mt-pattern,--ribo-pattern,--hb-pattern- Gene name patterns for different species
Use --help to see current default values.
Outputs:
All files are saved to <input_basename>_qc_results/ directory by default (or to the directory specified by --output-dir):
qc_metrics_before_filtering.png- Pre-filtering visualizationsqc_filtering_thresholds.png- MAD-based threshold overlaysqc_metrics_after_filtering.png- Post-filtering quality metrics<input_basename>_filtered.h5ad- Clean, filtered dataset ready for downstream analysis<input_basename>_with_qc.h5ad- Original data with QC annotations preserved
If copying outputs to /mnt/user-data/outputs/ for user access, copy individual files (not the entire directory) so users can preview them directly as Claude.ai artifacts.
Workflow Steps
The script performs the following steps:
- Calculate QC metrics - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
- Apply MAD-based filtering - Permissive outlier detection using MAD thresholds for counts/genes/MT%
- Filter genes - Remove genes detected in few cells
- Generate visualizations - Comprehensive before/after plots with threshold overlays
Approach 2: Modular Building Blocks (For Custom Workflows)
For custom analysis workflows or non-standard requirements, use the modular utility functions from scripts/qc_core.py and scripts/qc_plotting.py:
# Run from scripts/ directory, or add scripts/ to sys.path if needed
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions # Only if visualization needed
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# ... custom analysis logic here
When to use this approach:
- Different workflow needed (skip steps, change order, apply different thresholds to subsets)
- Conditional logic (e.g., filter neurons differently than other cells)
- Partial execution (only metrics/visualization, no filtering)
- Integration with other analysis steps in a larger pipeline
- Custom filtering criteria beyond what command-line params support
Available utility functions:
From qc_core.py (core QC operations):
calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)- Calculate QC metrics and annotate adatadetect_outliers_mad(adata, metric, n_mads, verbose=True)- MAD-based outlier detection, returns boolean maskapply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)- Apply hard cutoffs, returns boolean maskfilter_cells(adata, mask, inplace=False)- Apply boolean mask to filter cellsfilter_genes(adata, min_cells=20, min_counts=None, inplace=True)- Filter genes by detectionprint_qc_summary(adata, label='')- Print summary statistics
From qc_plotting.py (visualization):
plot_qc_distributions(adata, output_path, title)- Generate comprehensive QC plotsplot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)- Visualize filtering thresholdsplot_qc_after_filtering(adata, output_path)- Generate post-filtering plots
Example custom workflows:
Example 1: Only calculate metrics and visualize, don't filter yet
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')
Example 2: Apply only MT% filtering, keep other metrics permissive
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# Only filter high MT% cells
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')
Example 3: Different thresholds for different subsets
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# Apply type-specific QC (assumes cell_type metadata exists)
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons
# Neurons tolerate higher MT%, other cells use stricter threshold
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
Best Practices
- Be permissive with filtering - Default thresholds intentionally retain most cells to avoid losing rare populations
- Inspect visualizations - Always review before/after plots to ensure filtering makes biological sense
- Consider dataset-specific factors - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
- Check gene annotations - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
- Iterate if needed - QC parameters may need adjustment based on the specific experiment or tissue type
Reference Materials
For detailed QC methodology, parameter rationale, and troubleshooting guidance, see references/scverse_qc_guidelines.md. This reference provides:
- Detailed explanations of each QC metric and why it matters
- Rationale for MAD-based thresholds and why they're better than fixed cutoffs
- Guidelines for interpreting QC visualizations (histograms, violin plots, scatter plots)
- Species-specific considerations for gene annotations
- When and how to adjust filtering parameters
- Advanced QC considerations (ambient RNA correction, doublet detection)
Load this reference when users need deeper understanding of the methodology or when troubleshooting QC issues.
Next Steps After QC
Typical downstream analysis steps:
- Ambient RNA correction (SoupX, CellBender)
- Doublet detection (scDblFinder)
- Normalization (log-normalize, scran)
- Feature selection and dimensionality reduction
- Clustering and cell type annotation
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
