Sampleinfo
by pwwang
The SampleInfo process is the pipeline entry point that reads sample metadata files, performs statistical analyses, and generates visualization reports.
Skill Details
Repository Files
1 file in this skill directory
name: sampleinfo description: The SampleInfo process is the pipeline entry point that reads sample metadata files, performs statistical analyses, and generates visualization reports.
SampleInfo Process Configuration
Purpose
The SampleInfo process is the pipeline entry point that reads sample metadata files, performs statistical analyses, and generates visualization reports.
When to Use
- Always required as first process unless using
LoadingRNAFromSeurat - When you have sample metadata in CSV/TSV format
- When you need to generate statistical summaries and visualizations
- When you want to add or transform metadata columns before downstream analysis
Note: Mutually exclusive with LoadingRNAFromSeurat.
Configuration Structure
[SampleInfo]
cache = true
[SampleInfo.in]
infile = "path/to/sample_info.txt" # Required: yes (unless using LoadingRNAFromSeurat)
[SampleInfo.envs]
sep = "\t" # str - File separator
mutaters = {} # dict - Column transformations using R expressions
save_mutated = false # bool - Save mutated columns to output
exclude_cols = "TCRData,BCRData,RNAData" # Columns hidden in report
defaults = { plot_type = "bar", more_formats = [], save_code = false }
stats = {} # dict - Statistical plot definitions
Required input columns: Sample (unique ID), RNAData (directory path)
Optional columns: TCRData, BCRData, additional metadata
Data format: CSV/TSV with header, RNA data must be Read10X()-compatible
Environment Variables
sep (string): Field separator - "\t", ",", ";", or any character
mutaters (dict): R expressions for dplyr::mutate(). Keys are column names, values are R expressions.
- Example:
mutaters = { "AgeGroup" = "ifelse(Age > 60, 'Senior', 'Adult')" } - Special function
paired()identifies paired samples:paired(., 'PatientID', 'Timepoint', c('T1', 'T2'))
save_mutated (bool): Save mutated columns to output file. Factor columns lose level ordering when saved as text.
exclude_cols (str/list): Comma-separated string or list of columns to exclude from report table.
defaults (dict): Default plot parameters inherited by all plots:
[SampleInfo.envs.defaults]
plot_type = "bar" # Plot type (see External References)
more_formats = [] # Additional formats: ["pdf", "svg"]
save_code = false # Save R code and data
subset = null # dplyr::filter expression
section = null # Report section name
descr = null # Plot description
width = null, height = null, res = 100 # Plot dimensions
stats (dict): Plot definitions. Keys are case names (titles), values inherit from defaults.
External References
Plotthis Functions
| plot_type | Function | Description |
|---|---|---|
pie |
PieChart() |
Pie chart |
bar |
BarPlot() |
Bar plot |
box |
BoxPlot() |
Box plot |
violin |
ViolinPlot() |
Violin plot |
histogram |
Histogram() |
Histogram |
density |
DensityPlot() |
Density plot |
scatter |
ScatterPlot() |
Scatter plot |
line |
LinePlot() |
Line plot |
ridge |
RidgePlot() |
Ridge plot |
heatmap |
Heatmap() |
Heatmap |
Full reference: https://pwwang.github.io/plotthis/reference/
Common Plot Parameters
x = "column_name", y = "column_name" # Axis columns
split_by = "column_name", facet_by = "column_name" # Split/facet
palette = "Paired", alpha = 1.0 # Color and transparency
title = "Plot Title", nrow = 2, ncol = 3 # Layout
legend.position = "right" # Legend placement
dplyr::filter() for subset
subset = "Sample == 'A'"
subset = "Age > 60"
subset = "Diagnosis %in% c('Colitis', 'Control')"
subset = "Sex == 'F' & Age > 50"
Configuration Examples
Minimal Configuration
[SampleInfo.in]
infile = "samples.txt"
Basic Statistics
[SampleInfo.in]
infile = "sample_info.txt"
[SampleInfo.envs.stats."Samples_per_Diagnosis"]
plot_type = "bar"
x = "Sample"
split_by = "Diagnosis"
Advanced Configuration
[SampleInfo.in]
infile = "metadata/samples.tsv"
[SampleInfo.envs]
save_mutated = true
mutaters = { "AgeGroup" = "ifelse(Age > 60, 'Senior', 'Adult')" }
[SampleInfo.envs.stats."N_Samples_per_Diagnosis"]
x = "Sample"
split_by = "Diagnosis"
[SampleInfo.envs.stats."Age_distribution"]
plot_type = "histogram"
x = "Age"
Common Patterns
Paired Sample Identification
[SampleInfo.envs]
mutaters = { "PairID" = "paired(., 'PatientID', 'Timepoint', c('T1', 'T2'))" }
[SampleInfo.envs.stats."Paired_Samples"]
x = "PairID"
subset = "!is.na(PairID)"
Subset Analysis
[SampleInfo.envs.stats."Controls_Only"]
x = "Sample"
split_by = "Diagnosis"
subset = "Diagnosis == 'Control'"
Dependencies
- Upstream: None (entry point process)
- Downstream: All pipeline processes depend on SampleInfo output
SeuratPreparing: Reads sample metadataScRepLoading: Uses TCRData/BCRData columns- All downstream: Use metadata columns for analysis
Validation Rules
Common Errors
- Missing input file: Always specify
[SampleInfo.in.infile] - Invalid separator: Match separator to file format (e.g.,
sep = ","for CSV) - Missing required columns: Ensure
SampleandRNADatacolumns exist - Factor level ordering: Don't use
save_mutatedfor factor columns - useSeuratPreparing.envs.mutatersinstead
Value Constraints
sep: Single character stringmutaters: Valid R expressionsstatskeys: Must be unique case namesdevpars.res: Positive integer (default: 100)
Troubleshooting
-
Issue: SampleInfo re-runs entire pipeline on parameter change Solution: Set
cache = "force"at pipeline level and[SampleInfo] cache = false -
Issue: Factor levels appear in wrong order Solution: Use
SeuratPreparing.envs.mutatersfor factor columns -
Issue: Plots don't show expected data Solution: Check column names in
x,y,split_bymatch input file exactly -
Issue: Paired sample function returns
NAvalues Solution: Useuniq = falseinpaired()or adjustidentsparameter -
Issue: Mutations not saved for downstream use Solution: Set
save_mutated = true. For Seurat metadata, useSeuratPreparing.envs.mutaters -
Issue: Plot type not recognized Solution: Ensure
plot_typeis lowercase and maps to a plotthis function
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
