Polars

by eyadsibai

apidata

Use when "Polars", "fast dataframe", "lazy evaluation", "Arrow backend", or asking about "pandas alternative", "parallel dataframe", "large CSV processing", "ETL pipeline", "expression API

Skill Details

Repository Files

1 file in this skill directory


name: polars description: Use when "Polars", "fast dataframe", "lazy evaluation", "Arrow backend", or asking about "pandas alternative", "parallel dataframe", "large CSV processing", "ETL pipeline", "expression API" version: 1.0.0

Polars Fast DataFrame Library

Lightning-fast DataFrame library with lazy evaluation and parallel execution.

When to Use

  • Pandas is too slow for your dataset
  • Working with 1-100GB datasets that fit in RAM
  • Need lazy evaluation for query optimization
  • Building ETL pipelines
  • Want parallel execution without extra config

Lazy vs Eager Evaluation

Mode Function Executes Use Case
Eager read_csv() Immediately Small data, exploration
Lazy scan_csv() On .collect() Large data, pipelines

Key concept: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).


Core Operations

Data Selection

Operation Purpose
select() Choose columns
filter() Choose rows by condition
with_columns() Add/modify columns
drop() Remove columns
head(n) / tail(n) First/last n rows

Aggregation

Operation Purpose
group_by().agg() Group and aggregate
pivot() Reshape wide
melt() Reshape long
unique() Distinct values

Joins

Join Type Description
inner Matching rows only
left All left + matching right
outer All rows from both
cross Cartesian product
semi Left rows with match
anti Left rows without match

Expression API

Key concept: Polars uses expressions (pl.col()) instead of indexing. Expressions are lazily evaluated and optimized.

Common Expressions

Expression Purpose
pl.col("name") Reference column
pl.lit(value) Literal value
pl.all() All columns
pl.exclude(...) All except

Expression Methods

Category Methods
Aggregation .sum(), .mean(), .min(), .max(), .count()
String .str.contains(), .str.replace(), .str.to_lowercase()
DateTime .dt.year(), .dt.month(), .dt.day()
Conditional .when().then().otherwise()
Window .over(), .rolling_mean(), .shift()

Pandas Migration

Pandas Polars
df['col'] df.select('col')
df[df['col'] > 5] df.filter(pl.col('col') > 5)
df['new'] = df['col'] * 2 df.with_columns((pl.col('col') * 2).alias('new'))
df.groupby('col').mean() df.group_by('col').agg(pl.all().mean())
df.apply(func) df.map_rows(func) (avoid if possible)

Key concept: Polars prefers explicit operations over implicit indexing. Use .alias() to name computed columns.


File I/O

Format Read Write Notes
CSV read_csv() / scan_csv() write_csv() Human readable
Parquet read_parquet() / scan_parquet() write_parquet() Fast, compressed
JSON read_json() / scan_ndjson() write_json() Newline-delimited
IPC/Arrow read_ipc() / scan_ipc() write_ipc() Zero-copy

Key concept: Use Parquet for performance. Use scan_* for large files to enable lazy optimization.


Performance Tips

Tip Why
Use lazy mode Query optimization
Use Parquet Column-oriented, compressed
Select columns early Projection pushdown
Filter early Predicate pushdown
Avoid Python UDFs Breaks parallelism
Use expressions Vectorized operations
Set dtypes on read Avoid inference overhead

vs Alternatives

Tool Best For Limitations
Polars 1-100GB, speed critical Must fit in RAM
Pandas Small data, ecosystem Slow, memory hungry
Dask Larger than RAM More complex API
Spark Cluster computing Infrastructure overhead
DuckDB SQL interface Different API style

Resources

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Technical
Version:1.0.0
Last Updated:1/15/2026