Qc Statistics

by shandley

workflowdata

QC and sequencing statistics workflows using biometal primitives. Trigger when performing quality control, calculating sequencing statistics, generating QC reports, assessing data quality, or evaluating run metrics. All primitives exist - 100% coverage.

Skill Details

Repository Files

3 files in this skill directory


name: qc-statistics description: QC and sequencing statistics workflows using biometal primitives. Trigger when performing quality control, calculating sequencing statistics, generating QC reports, assessing data quality, or evaluating run metrics. All primitives exist - 100% coverage.

QC & Sequencing Statistics

Purpose: Quality control and statistics workflows with constant memory.

Primitives Status: ✅ 100% COMPLETE - All QC primitives implemented


Core Principle

All QC operations use streaming - constant ~3 MB memory regardless of file size.

No loading: Process one record at a time No limits: QC on TB-scale datasets with laptop Production-ready: Validated on 22M+ reads


Quick Reference

QC Type Primitives Available
Pre-alignment FASTQ parsing, quality filtering, base counting, GC content, complexity, trimming
Post-alignment BAM parsing, pileup, coverage, MAPQ filtering, flag filtering, insert size
Both Sequence operations, statistics, format validation

Pre-Alignment QC (FASTQ)

Basic Statistics

import biometal

stream = biometal.FastqStream.from_path("reads.fq.gz")

stats = {
    "total_reads": 0,
    "total_bases": 0,
    "quality_sum": 0,
    "gc_sum": 0,
    "length_sum": 0
}

for record in stream:
    stats["total_reads"] += 1
    stats["total_bases"] += len(record.sequence)
    stats["quality_sum"] += biometal.mean_quality(record.quality_scores) * len(record.sequence)
    stats["gc_sum"] += biometal.gc_content(record.sequence) * len(record.sequence)
    stats["length_sum"] += len(record.sequence)

# Calculate means
mean_quality = stats["quality_sum"] / stats["total_bases"]
mean_gc = stats["gc_sum"] / stats["total_bases"]
mean_length = stats["length_sum"] / stats["total_reads"]

print(f"Total reads: {stats['total_reads']:,}")
print(f"Total bases: {stats['total_bases']:,}")
print(f"Mean quality: {mean_quality:.1f}")
print(f"Mean GC: {mean_gc:.1%}")
print(f"Mean length: {mean_length:.1f} bp")

Quality Distribution

import biometal

stream = biometal.FastqStream.from_path("reads.fq.gz")

# Quality bins: 0-9, 10-19, 20-29, 30-39, 40+
quality_bins = {
    "Q0-9": 0,
    "Q10-19": 0,
    "Q20-29": 0,
    "Q30-39": 0,
    "Q40+": 0
}

for record in stream:
    mean_q = biometal.mean_quality(record.quality_scores)

    if mean_q < 10:
        quality_bins["Q0-9"] += 1
    elif mean_q < 20:
        quality_bins["Q10-19"] += 1
    elif mean_q < 30:
        quality_bins["Q20-29"] += 1
    elif mean_q < 40:
        quality_bins["Q30-39"] += 1
    else:
        quality_bins["Q40+"] += 1

# Print distribution
total = sum(quality_bins.values())
for bin_name, count in quality_bins.items():
    pct = count / total * 100
    print(f"{bin_name}: {count:,} ({pct:.1f}%)")

Base Composition

import biometal

stream = biometal.FastqStream.from_path("reads.fq.gz")

# Accumulate base counts
total_counts = {"A": 0, "C": 0, "G": 0, "T": 0, "N": 0}

for record in stream:
    counts = biometal.count_bases(record.sequence)
    total_counts["A"] += counts.a
    total_counts["C"] += counts.c
    total_counts["G"] += counts.g
    total_counts["T"] += counts.t
    total_counts["N"] += counts.n

# Calculate percentages
total_bases = sum(total_counts.values())
for base, count in total_counts.items():
    pct = count / total_bases * 100
    print(f"{base}: {count:,} ({pct:.2f}%)")

# GC content
gc_content = (total_counts["G"] + total_counts["C"]) / total_bases
print(f"\nGC content: {gc_content:.1%}")

Per-Position Quality

import biometal
import numpy as np

stream = biometal.FastqStream.from_path("reads.fq.gz")

# Track quality scores by position
max_length = 500  # Adjust based on your read length
position_qualities = [[] for _ in range(max_length)]

for record in stream:
    for pos, qual in enumerate(record.quality_scores):
        if pos < max_length:
            position_qualities[pos].append(qual)

# Calculate mean quality per position
mean_per_position = [
    np.mean(quals) if quals else 0
    for quals in position_qualities
]

# Find positions with declining quality
print("Position\tMean Quality")
for pos, mean_q in enumerate(mean_per_position[:100]):  # First 100 bp
    if mean_q > 0:
        print(f"{pos+1}\t{mean_q:.1f}")

See FASTQ_QC.md for complete pre-alignment workflows.


Post-Alignment QC (BAM)

Alignment Statistics

import biometal

reader = biometal.BamReader.from_path("alignments.bam")

stats = {
    "total": 0,
    "mapped": 0,
    "paired": 0,
    "proper_pair": 0,
    "duplicate": 0,
    "secondary": 0,
    "supplementary": 0,
    "mapq_sum": 0
}

for record in reader:
    stats["total"] += 1

    if record.is_mapped:
        stats["mapped"] += 1
        stats["mapq_sum"] += record.mapq

    if record.is_paired:
        stats["paired"] += 1

    if record.is_proper_pair:
        stats["proper_pair"] += 1

    if record.is_duplicate:
        stats["duplicate"] += 1

    if record.is_secondary:
        stats["secondary"] += 1

    if record.is_supplementary:
        stats["supplementary"] += 1

# Calculate rates
mapping_rate = stats["mapped"] / stats["total"]
proper_pair_rate = stats["proper_pair"] / stats["paired"] if stats["paired"] > 0 else 0
duplicate_rate = stats["duplicate"] / stats["total"]
mean_mapq = stats["mapq_sum"] / stats["mapped"] if stats["mapped"] > 0 else 0

print(f"Total reads: {stats['total']:,}")
print(f"Mapped: {stats['mapped']:,} ({mapping_rate:.1%})")
print(f"Properly paired: {stats['proper_pair']:,} ({proper_pair_rate:.1%})")
print(f"Duplicates: {stats['duplicate']:,} ({duplicate_rate:.1%})")
print(f"Mean MAPQ: {mean_mapq:.1f}")

Coverage Statistics

import biometal

reader = biometal.BamReader.from_path("alignments.bam")

# Configure pileup
config = biometal.PileupConfig.new()
config = config.with_min_mapq(20)
config = config.with_skip_duplicates(True)

# Calculate coverage over region
coverage = biometal.pileup_depth(reader, "chr1", 1000000, 2000000, config)

# Coverage statistics
depths = []
for interval in coverage:
    depths.append(interval.depth)

import numpy as np
mean_depth = np.mean(depths)
median_depth = np.median(depths)
std_depth = np.std(depths)

print(f"Mean coverage: {mean_depth:.1f}×")
print(f"Median coverage: {median_depth:.1f}×")
print(f"Std deviation: {std_depth:.1f}×")

# Coverage distribution
bins = [0, 10, 20, 30, 50, 100]
for i in range(len(bins)):
    if i < len(bins) - 1:
        low, high = bins[i], bins[i+1]
        count = sum(1 for d in depths if low <= d < high)
        pct = count / len(depths) * 100
        print(f"{low}-{high}×: {count:,} positions ({pct:.1f}%)")
    else:
        count = sum(1 for d in depths if d >= bins[i])
        pct = count / len(depths) * 100
        print(f"≥{bins[i]}×: {count:,} positions ({pct:.1f}%)")

Insert Size Distribution

import biometal

reader = biometal.BamReader.from_path("alignments.bam")

insert_sizes = []

for record in reader:
    if record.is_proper_pair and record.template_length > 0:
        insert_sizes.append(abs(record.template_length))

import numpy as np
mean_insert = np.mean(insert_sizes)
median_insert = np.median(insert_sizes)
std_insert = np.std(insert_sizes)

print(f"Mean insert size: {mean_insert:.0f} bp")
print(f"Median insert size: {median_insert:.0f} bp")
print(f"Std deviation: {std_insert:.0f} bp")

# Distribution
percentiles = [10, 25, 50, 75, 90, 95, 99]
for p in percentiles:
    value = np.percentile(insert_sizes, p)
    print(f"P{p}: {value:.0f} bp")

See ALIGNMENT_QC.md for complete post-alignment workflows.


Combined Workflows

FastQC-Style Report

import biometal

def generate_fastqc_report(fastq_path):
    """Generate comprehensive FASTQ QC report."""
    stream = biometal.FastqStream.from_path(fastq_path)

    stats = {
        "total_reads": 0,
        "total_bases": 0,
        "quality_sum": 0,
        "gc_sum": 0,
        "lengths": [],
        "low_quality": 0,
        "high_n_content": 0
    }

    for record in stream:
        stats["total_reads"] += 1
        length = len(record.sequence)
        stats["total_bases"] += length
        stats["lengths"].append(length)

        mean_q = biometal.mean_quality(record.quality_scores)
        stats["quality_sum"] += mean_q

        gc = biometal.gc_content(record.sequence)
        stats["gc_sum"] += gc

        # Flag low quality
        if mean_q < 20:
            stats["low_quality"] += 1

        # Flag high N content
        counts = biometal.count_bases(record.sequence)
        n_pct = counts.n / length if length > 0 else 0
        if n_pct > 0.05:  # > 5% N
            stats["high_n_content"] += 1

    # Generate report
    import numpy as np
    print("=" * 60)
    print("FASTQ Quality Control Report")
    print("=" * 60)
    print(f"\nBasic Statistics:")
    print(f"  Total reads: {stats['total_reads']:,}")
    print(f"  Total bases: {stats['total_bases']:,}")
    print(f"  Mean quality: {stats['quality_sum'] / stats['total_reads']:.1f}")
    print(f"  Mean GC: {stats['gc_sum'] / stats['total_reads']:.1%}")
    print(f"  Mean length: {np.mean(stats['lengths']):.1f} bp")
    print(f"  Length range: {min(stats['lengths'])}-{max(stats['lengths'])} bp")

    print(f"\nQuality Flags:")
    low_q_pct = stats['low_quality'] / stats['total_reads'] * 100
    high_n_pct = stats['high_n_content'] / stats['total_reads'] * 100
    print(f"  Low quality (Q<20): {stats['low_quality']:,} ({low_q_pct:.1f}%)")
    print(f"  High N content (>5%): {stats['high_n_content']:,} ({high_n_pct:.1f}%)")

    print(f"\nQC Decision: ", end="")
    if low_q_pct > 20 or high_n_pct > 10:
        print("⚠️  WARN - Quality issues detected")
    else:
        print("✅ PASS")
    print("=" * 60)

generate_fastqc_report("reads.fq.gz")

Flagstat-Style Report

import biometal

def generate_flagstat_report(bam_path):
    """Generate samtools flagstat-style report using flagstat primitive."""
    # Use direct flagstat primitive (recommended)
    stats = biometal.flagstat(bam_path)

    # Samtools-compatible output
    print(stats.to_flagstat_string())

    # Or access individual metrics programmatically
    print(f"\nProgrammatic access:")
    print(f"  Total reads: {stats.total():,}")
    print(f"  Mapped: {stats.mapped():,} ({stats.mapping_rate():.1%})")
    print(f"  Duplicates: {stats.duplicates():,} ({stats.duplicate_rate():.1%})")
    print(f"  Properly paired: {stats.properly_paired():,} ({stats.properly_paired_rate():.1%})")
    print(f"  Mean MAPQ: {stats.mean_mapq():.1f}")

generate_flagstat_report("alignments.bam")

Region-Specific Flagstat

import biometal

def region_flagstat(bam_path, chrom, start, end):
    """Calculate flagstat for a specific genomic region."""
    stats = biometal.flagstat_region(
        bam_path,
        f"{bam_path}.bai",
        chrom=chrom,
        start=start,
        end=end
    )

    print(f"Region {chrom}:{start:,}-{end:,}")
    print(f"  Reads: {stats.total():,}")
    print(f"  Mapped: {stats.mapped():,} ({stats.mapping_rate():.1%})")
    print(f"  Mean MAPQ: {stats.mean_mapq():.1f}")

region_flagstat("alignments.bam", "chr1", 1000000, 2000000)

Memory Guarantees

All QC workflows maintain constant ~3 MB memory:

Dataset Size Memory Used Validated
1 GB FASTQ 3.16 MB
10 GB FASTQ 3.16 MB
100 GB BAM ~3 MB
1 TB dataset ~3 MB Projected

Why constant: Streaming architecture processes one record at a time.


Performance Expectations

Validated January 2026 on Apple M1 Max:

Operation Throughput Notes
FASTQ parsing (gzipped) 250K reads/s Decompression-bound
FASTQ parsing (uncompressed) 1M reads/s CPU-bound
BAM parsing 118 MiB/s With BGZF decompression
Base counting 14.9 GiB/s ARM NEON acceleration
GC content 17.5 GiB/s ARM NEON acceleration
Quality mean ~10 GiB/s Auto-vectorized

Scaling: QC on 100 GB FASTQ takes ~7 minutes (gzipped), constant 3 MB memory.


Common QC Patterns

Pattern 1: Pre-Filter Before Analysis

import biometal

# Filter low-quality reads before downstream analysis
stream = biometal.FastqStream.from_path("reads.fq.gz")
passed = 0

for record in stream:
    mean_q = biometal.mean_quality(record.quality_scores)
    if mean_q >= 30 and len(record.sequence) >= 100:
        # Pass to downstream analysis
        passed += 1

print(f"Passed QC: {passed:,} reads")

Pattern 2: Target Enrichment QC

import biometal

# Check on-target rate
index = biometal.BaiIndex.from_path("alignments.bam.bai")
reader = biometal.BamReader.from_path_with_index("alignments.bam", index)

# Load target regions from BED
targets = list(biometal.Bed6Stream.from_path("targets.bed"))

on_target = 0
total_mapped = 0

for target in targets:
    for record in reader.query(target.chrom, target.start, target.end):
        if record.is_mapped:
            on_target += 1

# Get total mapped
reader_full = biometal.BamReader.from_path("alignments.bam")
total_mapped = sum(1 for r in reader_full if r.is_mapped)

enrichment = on_target / total_mapped if total_mapped > 0 else 0
print(f"On-target rate: {enrichment:.1%}")

Pattern 3: Contamination Check

import biometal

# Check for adapter contamination
stream = biometal.FastqStream.from_path("reads.fq.gz")
adapter = "AGATCGGAAGAG"  # Illumina adapter

contaminated = 0

for record in stream:
    if adapter in record.sequence:
        contaminated += 1

contamination_rate = contaminated / stream.total * 100
print(f"Adapter contamination: {contamination_rate:.2f}%")

QC Decision Criteria

FASTQ QC Pass/Fail

PASS if:

  • Mean quality ≥ Q30
  • < 10% reads below Q20
  • GC content 35-65% (species-dependent)
  • < 5% reads with high N content (> 5% N)

WARN if:

  • Mean quality Q20-Q30
  • 10-20% reads below Q20
  • GC content outside expected range
  • 5-10% reads with high N content

FAIL if:

  • Mean quality < Q20
  • 20% reads below Q20

  • 10% reads with high N content

BAM QC Pass/Fail

PASS if:

  • Mapping rate ≥ 90%
  • Duplicate rate < 20%
  • Mean MAPQ ≥ 30
  • Proper pair rate ≥ 90% (for paired-end)

WARN if:

  • Mapping rate 70-90%
  • Duplicate rate 20-40%
  • Mean MAPQ 20-30
  • Proper pair rate 70-90%

FAIL if:

  • Mapping rate < 70%
  • Duplicate rate > 40%
  • Mean MAPQ < 20
  • Proper pair rate < 70%

Note: Criteria are guidelines - adjust based on experiment type and species.


Troubleshooting

"Memory usage higher than expected"

Cause: Accumulating data in memory (e.g., storing all quality scores)

Solution:

# Bad - stores everything
all_qualities = []
for record in stream:
    all_qualities.extend(record.quality_scores)  # Growing list

# Good - accumulates statistics only
quality_sum = 0
count = 0
for record in stream:
    quality_sum += biometal.mean_quality(record.quality_scores)
    count += 1
mean_quality = quality_sum / count

"QC is slow on compressed files"

Expected: Gzipped files are decompression-bound (~250K reads/s)

Solutions:

  • Use uncompressed files if disk space allows (4× faster)
  • Accept that decompression is the bottleneck (biometal is already optimized)
  • Compressed throughput is still faster than most tools

"Statistics don't match samtools flagstat"

Cause: Different filtering defaults

Solution: Use same filters as samtools:

config = biometal.PileupConfig.new()
config = config.with_min_mapq(0)  # samtools default
config = config.with_skip_duplicates(False)  # include duplicates
config = config.with_skip_secondary(True)  # skip secondary

See Also


Remember: All QC operations stream with constant memory. No need to worry about file size - 1 GB or 1 TB, memory usage is the same ~3 MB.

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:1/31/2026