Dask

by eyadsibai

data

Use when "Dask", "parallel computing", "distributed computing", "larger than memory", or asking about "parallel pandas", "parallel numpy", "out-of-core", "multi-file processing", "cluster computing", "lazy evaluation dataframe

Skill Details

Repository Files

1 file in this skill directory


name: dask description: Use when "Dask", "parallel computing", "distributed computing", "larger than memory", or asking about "parallel pandas", "parallel numpy", "out-of-core", "multi-file processing", "cluster computing", "lazy evaluation dataframe" version: 1.0.0

Dask Parallel and Distributed Computing

Scale pandas/NumPy workflows beyond memory and across clusters.

When to Use

  • Datasets exceed available RAM
  • Need to parallelize pandas or NumPy operations
  • Processing multiple files efficiently (CSVs, Parquet)
  • Building custom parallel workflows
  • Distributing workloads across multiple cores/machines

Dask Collections

Collection Like Use Case
DataFrame pandas Tabular data, CSV/Parquet
Array NumPy Numerical arrays, matrices
Bag list Unstructured data, JSON logs
Delayed Custom Arbitrary Python functions

Key concept: All collections are lazy—computation happens only when you call .compute().


Lazy Evaluation

Function Behavior Use
dd.read_csv() Lazy load Large CSVs
dd.read_parquet() Lazy load Large Parquet
Operations Build graph Chain transforms
.compute() Execute Get final result

Key concept: Dask builds a task graph of operations, optimizes it, then executes in parallel. Call .compute() once at the end, not after every operation.


Schedulers

Scheduler Best For Start
threaded NumPy/Pandas (releases GIL) Default
processes Pure Python (GIL bound) scheduler='processes'
synchronous Debugging scheduler='synchronous'
distributed Monitoring, scaling, clusters Client()

Distributed Scheduler

Feature Benefit
Dashboard Real-time progress monitoring
Cluster scaling Add/remove workers
Fault tolerance Retry failed tasks
Worker resources Memory management

Chunking Concepts

DataFrame Partitions

Concept Description
Partition Subset of rows (like a mini DataFrame)
npartitions Number of partitions
divisions Index boundaries between partitions

Array Chunks

Concept Description
Chunk Subset of array (n-dimensional block)
chunks Tuple of chunk sizes per dimension
Optimal size ~100 MB per chunk

Key concept: Chunk size is critical. Too small = scheduling overhead. Too large = memory issues. Target ~100 MB.


DataFrame Operations

Supported (parallel)

Category Operations
Selection filter, loc, column selection
Aggregation groupby, sum, mean, count
Transforms apply (row-wise), map_partitions
Joins merge, join (shuffles data)
I/O read_csv, read_parquet, to_parquet

Avoid or Use Carefully

Operation Issue Alternative
iterrows Kills parallelism map_partitions
apply(axis=1) Slow map_partitions
Repeated compute() Inefficient Single compute() at end
sort_values Expensive shuffle Avoid if possible

Common Patterns

ETL Pipeline

  1. scan_* or read_* (lazy load)
  2. Chain filters and transforms
  3. Single .compute() or .to_parquet()

Multi-File Processing

Pattern Description
Glob patterns dd.read_csv('data/*.csv')
Partition per file Natural parallelism
Output partitioned to_parquet('output/')

Custom Operations

Method Use Case
map_partitions Apply function to each partition
map_blocks Apply function to each array block
delayed Wrap arbitrary Python functions

Best Practices

Practice Why
Don't load locally first Let Dask handle loading
Single compute() at end Avoid redundant computation
Use Parquet Faster than CSV, columnar
Match partition to files One partition per file
Check task graph size len(ddf.__dask_graph__()) < 100k
Use distributed for debugging Dashboard shows progress

Common Pitfalls

Pitfall Solution
Loading with pandas first Use dd.read_* directly
compute() in loops Collect all, single compute()
Too many partitions Repartition to ~100 MB each
Memory errors Reduce chunk size, add workers
Slow shuffles Avoid sorts/joins when possible

vs Alternatives

Tool Best For Trade-off
Dask Scale pandas/NumPy, clusters Setup complexity
Polars Fast in-memory Must fit in RAM
Vaex Out-of-core single machine Limited operations
Spark Enterprise, SQL-heavy Infrastructure

Resources

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Version:1.0.0
Last Updated:1/15/2026