Qc Statistics
by shandley
QC and sequencing statistics workflows using biometal primitives. Trigger when performing quality control, calculating sequencing statistics, generating QC reports, assessing data quality, or evaluating run metrics. All primitives exist - 100% coverage.
Skill Details
Repository Files
3 files in this skill directory
name: qc-statistics description: QC and sequencing statistics workflows using biometal primitives. Trigger when performing quality control, calculating sequencing statistics, generating QC reports, assessing data quality, or evaluating run metrics. All primitives exist - 100% coverage.
QC & Sequencing Statistics
Purpose: Quality control and statistics workflows with constant memory.
Primitives Status: ✅ 100% COMPLETE - All QC primitives implemented
Core Principle
All QC operations use streaming - constant ~3 MB memory regardless of file size.
No loading: Process one record at a time No limits: QC on TB-scale datasets with laptop Production-ready: Validated on 22M+ reads
Quick Reference
| QC Type | Primitives Available |
|---|---|
| Pre-alignment | FASTQ parsing, quality filtering, base counting, GC content, complexity, trimming |
| Post-alignment | BAM parsing, pileup, coverage, MAPQ filtering, flag filtering, insert size |
| Both | Sequence operations, statistics, format validation |
Pre-Alignment QC (FASTQ)
Basic Statistics
import biometal
stream = biometal.FastqStream.from_path("reads.fq.gz")
stats = {
"total_reads": 0,
"total_bases": 0,
"quality_sum": 0,
"gc_sum": 0,
"length_sum": 0
}
for record in stream:
stats["total_reads"] += 1
stats["total_bases"] += len(record.sequence)
stats["quality_sum"] += biometal.mean_quality(record.quality_scores) * len(record.sequence)
stats["gc_sum"] += biometal.gc_content(record.sequence) * len(record.sequence)
stats["length_sum"] += len(record.sequence)
# Calculate means
mean_quality = stats["quality_sum"] / stats["total_bases"]
mean_gc = stats["gc_sum"] / stats["total_bases"]
mean_length = stats["length_sum"] / stats["total_reads"]
print(f"Total reads: {stats['total_reads']:,}")
print(f"Total bases: {stats['total_bases']:,}")
print(f"Mean quality: {mean_quality:.1f}")
print(f"Mean GC: {mean_gc:.1%}")
print(f"Mean length: {mean_length:.1f} bp")
Quality Distribution
import biometal
stream = biometal.FastqStream.from_path("reads.fq.gz")
# Quality bins: 0-9, 10-19, 20-29, 30-39, 40+
quality_bins = {
"Q0-9": 0,
"Q10-19": 0,
"Q20-29": 0,
"Q30-39": 0,
"Q40+": 0
}
for record in stream:
mean_q = biometal.mean_quality(record.quality_scores)
if mean_q < 10:
quality_bins["Q0-9"] += 1
elif mean_q < 20:
quality_bins["Q10-19"] += 1
elif mean_q < 30:
quality_bins["Q20-29"] += 1
elif mean_q < 40:
quality_bins["Q30-39"] += 1
else:
quality_bins["Q40+"] += 1
# Print distribution
total = sum(quality_bins.values())
for bin_name, count in quality_bins.items():
pct = count / total * 100
print(f"{bin_name}: {count:,} ({pct:.1f}%)")
Base Composition
import biometal
stream = biometal.FastqStream.from_path("reads.fq.gz")
# Accumulate base counts
total_counts = {"A": 0, "C": 0, "G": 0, "T": 0, "N": 0}
for record in stream:
counts = biometal.count_bases(record.sequence)
total_counts["A"] += counts.a
total_counts["C"] += counts.c
total_counts["G"] += counts.g
total_counts["T"] += counts.t
total_counts["N"] += counts.n
# Calculate percentages
total_bases = sum(total_counts.values())
for base, count in total_counts.items():
pct = count / total_bases * 100
print(f"{base}: {count:,} ({pct:.2f}%)")
# GC content
gc_content = (total_counts["G"] + total_counts["C"]) / total_bases
print(f"\nGC content: {gc_content:.1%}")
Per-Position Quality
import biometal
import numpy as np
stream = biometal.FastqStream.from_path("reads.fq.gz")
# Track quality scores by position
max_length = 500 # Adjust based on your read length
position_qualities = [[] for _ in range(max_length)]
for record in stream:
for pos, qual in enumerate(record.quality_scores):
if pos < max_length:
position_qualities[pos].append(qual)
# Calculate mean quality per position
mean_per_position = [
np.mean(quals) if quals else 0
for quals in position_qualities
]
# Find positions with declining quality
print("Position\tMean Quality")
for pos, mean_q in enumerate(mean_per_position[:100]): # First 100 bp
if mean_q > 0:
print(f"{pos+1}\t{mean_q:.1f}")
See FASTQ_QC.md for complete pre-alignment workflows.
Post-Alignment QC (BAM)
Alignment Statistics
import biometal
reader = biometal.BamReader.from_path("alignments.bam")
stats = {
"total": 0,
"mapped": 0,
"paired": 0,
"proper_pair": 0,
"duplicate": 0,
"secondary": 0,
"supplementary": 0,
"mapq_sum": 0
}
for record in reader:
stats["total"] += 1
if record.is_mapped:
stats["mapped"] += 1
stats["mapq_sum"] += record.mapq
if record.is_paired:
stats["paired"] += 1
if record.is_proper_pair:
stats["proper_pair"] += 1
if record.is_duplicate:
stats["duplicate"] += 1
if record.is_secondary:
stats["secondary"] += 1
if record.is_supplementary:
stats["supplementary"] += 1
# Calculate rates
mapping_rate = stats["mapped"] / stats["total"]
proper_pair_rate = stats["proper_pair"] / stats["paired"] if stats["paired"] > 0 else 0
duplicate_rate = stats["duplicate"] / stats["total"]
mean_mapq = stats["mapq_sum"] / stats["mapped"] if stats["mapped"] > 0 else 0
print(f"Total reads: {stats['total']:,}")
print(f"Mapped: {stats['mapped']:,} ({mapping_rate:.1%})")
print(f"Properly paired: {stats['proper_pair']:,} ({proper_pair_rate:.1%})")
print(f"Duplicates: {stats['duplicate']:,} ({duplicate_rate:.1%})")
print(f"Mean MAPQ: {mean_mapq:.1f}")
Coverage Statistics
import biometal
reader = biometal.BamReader.from_path("alignments.bam")
# Configure pileup
config = biometal.PileupConfig.new()
config = config.with_min_mapq(20)
config = config.with_skip_duplicates(True)
# Calculate coverage over region
coverage = biometal.pileup_depth(reader, "chr1", 1000000, 2000000, config)
# Coverage statistics
depths = []
for interval in coverage:
depths.append(interval.depth)
import numpy as np
mean_depth = np.mean(depths)
median_depth = np.median(depths)
std_depth = np.std(depths)
print(f"Mean coverage: {mean_depth:.1f}×")
print(f"Median coverage: {median_depth:.1f}×")
print(f"Std deviation: {std_depth:.1f}×")
# Coverage distribution
bins = [0, 10, 20, 30, 50, 100]
for i in range(len(bins)):
if i < len(bins) - 1:
low, high = bins[i], bins[i+1]
count = sum(1 for d in depths if low <= d < high)
pct = count / len(depths) * 100
print(f"{low}-{high}×: {count:,} positions ({pct:.1f}%)")
else:
count = sum(1 for d in depths if d >= bins[i])
pct = count / len(depths) * 100
print(f"≥{bins[i]}×: {count:,} positions ({pct:.1f}%)")
Insert Size Distribution
import biometal
reader = biometal.BamReader.from_path("alignments.bam")
insert_sizes = []
for record in reader:
if record.is_proper_pair and record.template_length > 0:
insert_sizes.append(abs(record.template_length))
import numpy as np
mean_insert = np.mean(insert_sizes)
median_insert = np.median(insert_sizes)
std_insert = np.std(insert_sizes)
print(f"Mean insert size: {mean_insert:.0f} bp")
print(f"Median insert size: {median_insert:.0f} bp")
print(f"Std deviation: {std_insert:.0f} bp")
# Distribution
percentiles = [10, 25, 50, 75, 90, 95, 99]
for p in percentiles:
value = np.percentile(insert_sizes, p)
print(f"P{p}: {value:.0f} bp")
See ALIGNMENT_QC.md for complete post-alignment workflows.
Combined Workflows
FastQC-Style Report
import biometal
def generate_fastqc_report(fastq_path):
"""Generate comprehensive FASTQ QC report."""
stream = biometal.FastqStream.from_path(fastq_path)
stats = {
"total_reads": 0,
"total_bases": 0,
"quality_sum": 0,
"gc_sum": 0,
"lengths": [],
"low_quality": 0,
"high_n_content": 0
}
for record in stream:
stats["total_reads"] += 1
length = len(record.sequence)
stats["total_bases"] += length
stats["lengths"].append(length)
mean_q = biometal.mean_quality(record.quality_scores)
stats["quality_sum"] += mean_q
gc = biometal.gc_content(record.sequence)
stats["gc_sum"] += gc
# Flag low quality
if mean_q < 20:
stats["low_quality"] += 1
# Flag high N content
counts = biometal.count_bases(record.sequence)
n_pct = counts.n / length if length > 0 else 0
if n_pct > 0.05: # > 5% N
stats["high_n_content"] += 1
# Generate report
import numpy as np
print("=" * 60)
print("FASTQ Quality Control Report")
print("=" * 60)
print(f"\nBasic Statistics:")
print(f" Total reads: {stats['total_reads']:,}")
print(f" Total bases: {stats['total_bases']:,}")
print(f" Mean quality: {stats['quality_sum'] / stats['total_reads']:.1f}")
print(f" Mean GC: {stats['gc_sum'] / stats['total_reads']:.1%}")
print(f" Mean length: {np.mean(stats['lengths']):.1f} bp")
print(f" Length range: {min(stats['lengths'])}-{max(stats['lengths'])} bp")
print(f"\nQuality Flags:")
low_q_pct = stats['low_quality'] / stats['total_reads'] * 100
high_n_pct = stats['high_n_content'] / stats['total_reads'] * 100
print(f" Low quality (Q<20): {stats['low_quality']:,} ({low_q_pct:.1f}%)")
print(f" High N content (>5%): {stats['high_n_content']:,} ({high_n_pct:.1f}%)")
print(f"\nQC Decision: ", end="")
if low_q_pct > 20 or high_n_pct > 10:
print("⚠️ WARN - Quality issues detected")
else:
print("✅ PASS")
print("=" * 60)
generate_fastqc_report("reads.fq.gz")
Flagstat-Style Report
import biometal
def generate_flagstat_report(bam_path):
"""Generate samtools flagstat-style report using flagstat primitive."""
# Use direct flagstat primitive (recommended)
stats = biometal.flagstat(bam_path)
# Samtools-compatible output
print(stats.to_flagstat_string())
# Or access individual metrics programmatically
print(f"\nProgrammatic access:")
print(f" Total reads: {stats.total():,}")
print(f" Mapped: {stats.mapped():,} ({stats.mapping_rate():.1%})")
print(f" Duplicates: {stats.duplicates():,} ({stats.duplicate_rate():.1%})")
print(f" Properly paired: {stats.properly_paired():,} ({stats.properly_paired_rate():.1%})")
print(f" Mean MAPQ: {stats.mean_mapq():.1f}")
generate_flagstat_report("alignments.bam")
Region-Specific Flagstat
import biometal
def region_flagstat(bam_path, chrom, start, end):
"""Calculate flagstat for a specific genomic region."""
stats = biometal.flagstat_region(
bam_path,
f"{bam_path}.bai",
chrom=chrom,
start=start,
end=end
)
print(f"Region {chrom}:{start:,}-{end:,}")
print(f" Reads: {stats.total():,}")
print(f" Mapped: {stats.mapped():,} ({stats.mapping_rate():.1%})")
print(f" Mean MAPQ: {stats.mean_mapq():.1f}")
region_flagstat("alignments.bam", "chr1", 1000000, 2000000)
Memory Guarantees
All QC workflows maintain constant ~3 MB memory:
| Dataset Size | Memory Used | Validated |
|---|---|---|
| 1 GB FASTQ | 3.16 MB | ✅ |
| 10 GB FASTQ | 3.16 MB | ✅ |
| 100 GB BAM | ~3 MB | ✅ |
| 1 TB dataset | ~3 MB | Projected |
Why constant: Streaming architecture processes one record at a time.
Performance Expectations
Validated January 2026 on Apple M1 Max:
| Operation | Throughput | Notes |
|---|---|---|
| FASTQ parsing (gzipped) | 250K reads/s | Decompression-bound |
| FASTQ parsing (uncompressed) | 1M reads/s | CPU-bound |
| BAM parsing | 118 MiB/s | With BGZF decompression |
| Base counting | 14.9 GiB/s | ARM NEON acceleration |
| GC content | 17.5 GiB/s | ARM NEON acceleration |
| Quality mean | ~10 GiB/s | Auto-vectorized |
Scaling: QC on 100 GB FASTQ takes ~7 minutes (gzipped), constant 3 MB memory.
Common QC Patterns
Pattern 1: Pre-Filter Before Analysis
import biometal
# Filter low-quality reads before downstream analysis
stream = biometal.FastqStream.from_path("reads.fq.gz")
passed = 0
for record in stream:
mean_q = biometal.mean_quality(record.quality_scores)
if mean_q >= 30 and len(record.sequence) >= 100:
# Pass to downstream analysis
passed += 1
print(f"Passed QC: {passed:,} reads")
Pattern 2: Target Enrichment QC
import biometal
# Check on-target rate
index = biometal.BaiIndex.from_path("alignments.bam.bai")
reader = biometal.BamReader.from_path_with_index("alignments.bam", index)
# Load target regions from BED
targets = list(biometal.Bed6Stream.from_path("targets.bed"))
on_target = 0
total_mapped = 0
for target in targets:
for record in reader.query(target.chrom, target.start, target.end):
if record.is_mapped:
on_target += 1
# Get total mapped
reader_full = biometal.BamReader.from_path("alignments.bam")
total_mapped = sum(1 for r in reader_full if r.is_mapped)
enrichment = on_target / total_mapped if total_mapped > 0 else 0
print(f"On-target rate: {enrichment:.1%}")
Pattern 3: Contamination Check
import biometal
# Check for adapter contamination
stream = biometal.FastqStream.from_path("reads.fq.gz")
adapter = "AGATCGGAAGAG" # Illumina adapter
contaminated = 0
for record in stream:
if adapter in record.sequence:
contaminated += 1
contamination_rate = contaminated / stream.total * 100
print(f"Adapter contamination: {contamination_rate:.2f}%")
QC Decision Criteria
FASTQ QC Pass/Fail
PASS if:
- Mean quality ≥ Q30
- < 10% reads below Q20
- GC content 35-65% (species-dependent)
- < 5% reads with high N content (> 5% N)
WARN if:
- Mean quality Q20-Q30
- 10-20% reads below Q20
- GC content outside expected range
- 5-10% reads with high N content
FAIL if:
- Mean quality < Q20
-
20% reads below Q20
-
10% reads with high N content
BAM QC Pass/Fail
PASS if:
- Mapping rate ≥ 90%
- Duplicate rate < 20%
- Mean MAPQ ≥ 30
- Proper pair rate ≥ 90% (for paired-end)
WARN if:
- Mapping rate 70-90%
- Duplicate rate 20-40%
- Mean MAPQ 20-30
- Proper pair rate 70-90%
FAIL if:
- Mapping rate < 70%
- Duplicate rate > 40%
- Mean MAPQ < 20
- Proper pair rate < 70%
Note: Criteria are guidelines - adjust based on experiment type and species.
Troubleshooting
"Memory usage higher than expected"
Cause: Accumulating data in memory (e.g., storing all quality scores)
Solution:
# Bad - stores everything
all_qualities = []
for record in stream:
all_qualities.extend(record.quality_scores) # Growing list
# Good - accumulates statistics only
quality_sum = 0
count = 0
for record in stream:
quality_sum += biometal.mean_quality(record.quality_scores)
count += 1
mean_quality = quality_sum / count
"QC is slow on compressed files"
Expected: Gzipped files are decompression-bound (~250K reads/s)
Solutions:
- Use uncompressed files if disk space allows (4× faster)
- Accept that decompression is the bottleneck (biometal is already optimized)
- Compressed throughput is still faster than most tools
"Statistics don't match samtools flagstat"
Cause: Different filtering defaults
Solution: Use same filters as samtools:
config = biometal.PileupConfig.new()
config = config.with_min_mapq(0) # samtools default
config = config.with_skip_duplicates(False) # include duplicates
config = config.with_skip_secondary(True) # skip secondary
See Also
- FASTQ_QC.md - Pre-alignment QC patterns
- ALIGNMENT_QC.md - Post-alignment QC patterns
- EXAMPLES.md - Complete QC workflows
Remember: All QC operations stream with constant memory. No need to worry about file size - 1 GB or 1 TB, memory usage is the same ~3 MB.
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
