Polars
by eyadsibai
Use when "Polars", "fast dataframe", "lazy evaluation", "Arrow backend", or asking about "pandas alternative", "parallel dataframe", "large CSV processing", "ETL pipeline", "expression API
Skill Details
Repository Files
1 file in this skill directory
name: polars description: Use when "Polars", "fast dataframe", "lazy evaluation", "Arrow backend", or asking about "pandas alternative", "parallel dataframe", "large CSV processing", "ETL pipeline", "expression API" version: 1.0.0
Polars Fast DataFrame Library
Lightning-fast DataFrame library with lazy evaluation and parallel execution.
When to Use
- Pandas is too slow for your dataset
- Working with 1-100GB datasets that fit in RAM
- Need lazy evaluation for query optimization
- Building ETL pipelines
- Want parallel execution without extra config
Lazy vs Eager Evaluation
| Mode | Function | Executes | Use Case |
|---|---|---|---|
| Eager | read_csv() |
Immediately | Small data, exploration |
| Lazy | scan_csv() |
On .collect() |
Large data, pipelines |
Key concept: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).
Core Operations
Data Selection
| Operation | Purpose |
|---|---|
select() |
Choose columns |
filter() |
Choose rows by condition |
with_columns() |
Add/modify columns |
drop() |
Remove columns |
head(n) / tail(n) |
First/last n rows |
Aggregation
| Operation | Purpose |
|---|---|
group_by().agg() |
Group and aggregate |
pivot() |
Reshape wide |
melt() |
Reshape long |
unique() |
Distinct values |
Joins
| Join Type | Description |
|---|---|
| inner | Matching rows only |
| left | All left + matching right |
| outer | All rows from both |
| cross | Cartesian product |
| semi | Left rows with match |
| anti | Left rows without match |
Expression API
Key concept: Polars uses expressions (pl.col()) instead of indexing. Expressions are lazily evaluated and optimized.
Common Expressions
| Expression | Purpose |
|---|---|
pl.col("name") |
Reference column |
pl.lit(value) |
Literal value |
pl.all() |
All columns |
pl.exclude(...) |
All except |
Expression Methods
| Category | Methods |
|---|---|
| Aggregation | .sum(), .mean(), .min(), .max(), .count() |
| String | .str.contains(), .str.replace(), .str.to_lowercase() |
| DateTime | .dt.year(), .dt.month(), .dt.day() |
| Conditional | .when().then().otherwise() |
| Window | .over(), .rolling_mean(), .shift() |
Pandas Migration
| Pandas | Polars |
|---|---|
df['col'] |
df.select('col') |
df[df['col'] > 5] |
df.filter(pl.col('col') > 5) |
df['new'] = df['col'] * 2 |
df.with_columns((pl.col('col') * 2).alias('new')) |
df.groupby('col').mean() |
df.group_by('col').agg(pl.all().mean()) |
df.apply(func) |
df.map_rows(func) (avoid if possible) |
Key concept: Polars prefers explicit operations over implicit indexing. Use .alias() to name computed columns.
File I/O
| Format | Read | Write | Notes |
|---|---|---|---|
| CSV | read_csv() / scan_csv() |
write_csv() |
Human readable |
| Parquet | read_parquet() / scan_parquet() |
write_parquet() |
Fast, compressed |
| JSON | read_json() / scan_ndjson() |
write_json() |
Newline-delimited |
| IPC/Arrow | read_ipc() / scan_ipc() |
write_ipc() |
Zero-copy |
Key concept: Use Parquet for performance. Use scan_* for large files to enable lazy optimization.
Performance Tips
| Tip | Why |
|---|---|
| Use lazy mode | Query optimization |
| Use Parquet | Column-oriented, compressed |
| Select columns early | Projection pushdown |
| Filter early | Predicate pushdown |
| Avoid Python UDFs | Breaks parallelism |
| Use expressions | Vectorized operations |
| Set dtypes on read | Avoid inference overhead |
vs Alternatives
| Tool | Best For | Limitations |
|---|---|---|
| Polars | 1-100GB, speed critical | Must fit in RAM |
| Pandas | Small data, ecosystem | Slow, memory hungry |
| Dask | Larger than RAM | More complex API |
| Spark | Cluster computing | Infrastructure overhead |
| DuckDB | SQL interface | Different API style |
Resources
- Docs: https://pola.rs/
- User Guide: https://docs.pola.rs/user-guide/
- Cookbook: https://docs.pola.rs/user-guide/misc/cookbook/
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
