Python Polars
by tbhb
This skill should be used when the user asks to "work with polars", "create a dataframe", "use lazy evaluation", "migrate from pandas", "optimize data pipelines", "read parquet files", "group by operations", or needs guidance on Polars DataFrame operations, expression API, performance optimization, or data transformation workflows.
Skill Details
Repository Files
7 files in this skill directory
name: python-polars description: This skill should be used when the user asks to "work with polars", "create a dataframe", "use lazy evaluation", "migrate from pandas", "optimize data pipelines", "read parquet files", "group by operations", or needs guidance on Polars DataFrame operations, expression API, performance optimization, or data transformation workflows.
Python Polars
Polars is a lightning-fast DataFrame library for Python built on Apache Arrow. It provides an expression-based API, lazy evaluation framework, and automatic parallelization for high-performance data processing.
Quick start
Installation
uv pip install polars
Basic operations
import polars as pl
# Create DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
# Select columns
df.select("name", "age")
# Filter rows
df.filter(pl.col("age") > 25)
# Add computed columns
df.with_columns(
age_plus_10=pl.col("age") + 10
)
Core concepts
Expressions
Expressions are composable units describing data transformations. Use pl.col("column_name") to reference columns and chain methods for complex operations:
df.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
Expressions execute within contexts: select(), with_columns(), filter(), group_by().agg().
Lazy vs eager evaluation
Eager (DataFrame): Operations execute immediately.
df = pl.read_csv("file.csv") # Reads immediately
result = df.filter(pl.col("age") > 25)
Lazy (LazyFrame): Operations build an optimized query plan.
lf = pl.scan_csv("file.csv") # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # Executes optimized query
Use lazy mode for large datasets, complex pipelines, and when performance is critical. Benefits include automatic query optimization, predicate pushdown, projection pushdown, and parallel execution.
For detailed concepts including data types, type casting, null handling, and parallelization, see references/core-concepts.md.
Common operations
Select and with_columns
# Select specific columns
df.select("name", "age")
# Select with expressions
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
# Add new columns (preserves existing)
df.with_columns(
age_doubled=pl.col("age") * 2,
name_upper=pl.col("name").str.to_uppercase()
)
Filter
# Single condition
df.filter(pl.col("age") > 25)
# Multiple conditions (AND)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
# OR conditions
df.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
Group by and aggregations
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
Window functions
Apply aggregations while preserving row count:
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
For comprehensive operations including sorting, conditionals, string/date operations, and list handling, see references/operations.md.
Data I/O
CSV
# Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
# Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
Parquet (recommended for performance)
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
# Lazy with predicate pushdown
lf = pl.scan_parquet("file.parquet")
For comprehensive I/O including JSON, Excel, databases, cloud storage, and streaming, see references/io-guide.md.
Transformations
Joins
# Inner join
df1.join(df2, on="id", how="inner")
# Left join
df1.join(df2, on="id", how="left")
# Different column names
df1.join(df2, left_on="user_id", right_on="id")
Concatenation
# Vertical (stack rows)
pl.concat([df1, df2], how="vertical")
# Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")
Pivot and unpivot
# Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")
# Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])
For detailed transformation patterns including asof joins, exploding, and transposing, see references/transformations.md.
Best practices
Performance optimization
-
Use lazy evaluation for large datasets:
lf = pl.scan_csv("large.csv") # Not read_csv result = lf.filter(...).select(...).collect() -
Avoid Python functions in hot paths - stay within the expression API for parallelization:
# Good: Native expression (parallelized) df.with_columns(result=pl.col("value") * 2) # Avoid: Python function (sequential) df.with_columns(result=pl.col("value").map_elements(lambda x: x * 2)) -
Select only needed columns early:
lf.select("col1", "col2").filter(...) # Good lf.filter(...).select("col1", "col2") # Less optimal -
Use streaming for very large data:
lf.collect(streaming=True) -
Use appropriate data types - Categorical for low-cardinality strings, appropriate integer sizes.
Conditional operations
pl.when(condition).then(value).otherwise(other_value)
Null handling
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
For comprehensive best practices including anti-patterns, memory management, testing, and code organization, see references/best-practices.md.
Pandas migration
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
- No index: Polars uses integer positions only
- Strict typing: No silent type conversions
- Lazy evaluation: Available via LazyFrame
- Parallel by default: Operations parallelized automatically
Common operation mappings
| Operation | pandas | Polars |
|---|---|---|
| Select | df["col"] |
df.select("col") |
| Filter | df[df["col"] > 10] |
df.filter(pl.col("col") > 10) |
| Add column | df.assign(x=...) |
df.with_columns(x=...) |
| Group by | df.groupby("col").agg(...) |
df.group_by("col").agg(...) |
| Window | df.groupby("col").transform(...) |
df.with_columns(...).over("col") |
For comprehensive migration guide including operation mappings, migration patterns, and anti-patterns to avoid, see references/pandas-migration.md.
References
This skill includes comprehensive reference documentation:
references/core-concepts.md- Expressions, data types, lazy evaluation, parallelizationreferences/operations.md- Selection, filtering, grouping, window functions, string/date operationsreferences/best-practices.md- Performance optimization, anti-patterns, memory managementreferences/io-guide.md- CSV, Parquet, JSON, Excel, databases, cloud storagereferences/transformations.md- Joins, concatenation, pivots, reshaping operationsreferences/pandas-migration.md- Migration guide from pandas to Polars
Load these references as needed for detailed information on specific topics.
Related Skills
Xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Clickhouse Io
ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
Analyzing Financial Statements
This skill calculates key financial ratios and metrics from financial statement data for investment analysis
Data Storytelling
Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.
Kpi Dashboard Design
Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.
Dbt Transformation Patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
Sql Optimization Patterns
Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
Anndata
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
Xlsx
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
