Pandas Best Practices

by Mindrally

data

Best practices for Pandas data manipulation, analysis, and DataFrame operations in Python

Skill Details

Repository Files

1 file in this skill directory


name: pandas-best-practices description: Best practices for Pandas data manipulation, analysis, and DataFrame operations in Python

Pandas Best Practices

Expert guidelines for Pandas development, focusing on data manipulation, analysis, and efficient DataFrame operations.

Code Style and Structure

  • Write concise, technical responses with accurate Python examples
  • Prioritize reproducibility in data analysis workflows
  • Use functional programming; avoid unnecessary classes
  • Prefer vectorized operations over explicit loops
  • Use descriptive variable names reflecting data content
  • Follow PEP 8 style guidelines

DataFrame Creation and I/O

  • Use pd.read_csv(), pd.read_excel(), pd.read_json() with appropriate parameters
  • Specify dtype parameter to ensure correct data types on load
  • Use parse_dates for automatic datetime parsing
  • Set index_col when the data has a natural index column
  • Use chunksize for reading large files incrementally

Data Selection

  • Use .loc[] for label-based indexing
  • Use .iloc[] for integer position-based indexing
  • Avoid chained indexing (e.g., df['col'][0]) - use .loc or .iloc instead
  • Use boolean indexing for conditional selection: df[df['col'] > value]
  • Use .query() method for complex filtering conditions

Method Chaining

  • Prefer method chaining for data transformations when possible
  • Use .pipe() for applying custom functions in a chain
  • Chain operations like .assign(), .query(), .groupby(), .agg()
  • Keep chains readable by breaking across multiple lines

Data Cleaning and Validation

Missing Data

  • Check for missing data with .isna() and .info()
  • Handle missing data appropriately: .fillna(), .dropna(), or imputation
  • Use pd.NA for nullable integer and boolean types
  • Document decisions about missing data handling

Data Quality Checks

  • Implement data quality checks at the beginning of analysis
  • Validate data types with .dtypes and convert as needed
  • Check for duplicates with .duplicated() and handle appropriately
  • Use .describe() for quick statistical overview

Type Conversion

  • Use .astype() for explicit type conversion
  • Use pd.to_datetime() for date parsing
  • Use pd.to_numeric() with errors='coerce' for safe numeric conversion
  • Utilize categorical data types for low-cardinality string columns

Grouping and Aggregation

GroupBy Operations

  • Use .groupby() for efficient aggregation operations
  • Specify aggregation functions with .agg() for multiple operations
  • Use named aggregation for clearer output column names
  • Consider .transform() for broadcasting results back to original shape

Pivot Tables and Reshaping

  • Use .pivot_table() for multi-dimensional aggregation
  • Use .melt() to convert wide to long format
  • Use .pivot() to convert long to wide format
  • Use .stack() and .unstack() for hierarchical index manipulation

Performance Optimization

Memory Efficiency

  • Use categorical data types for low-cardinality strings
  • Downcast numeric types when appropriate
  • Use pd.eval() and .eval() for large expression evaluation

Computation Speed

  • Use vectorized operations instead of .apply() with row-wise functions
  • Prefer built-in aggregation functions over custom ones
  • Use .values or .to_numpy() for NumPy operations when faster

Avoiding Common Pitfalls

  • Avoid iterating with .iterrows() - use vectorized operations
  • Don't modify DataFrames while iterating
  • Be aware of SettingWithCopyWarning - use .copy() when needed
  • Avoid growing DataFrames row by row - collect in list and create once

Time Series Operations

  • Use DatetimeIndex for time series data
  • Leverage .resample() for time-based aggregation
  • Use .shift() and .diff() for lag operations
  • Use .rolling() and .expanding() for window calculations

Merging and Joining

  • Use .merge() for SQL-style joins
  • Specify how parameter: 'inner', 'outer', 'left', 'right'
  • Use validate parameter to check join cardinality
  • Use .concat() for stacking DataFrames

Key Conventions

  • Import as import pandas as pd
  • Use snake_case for column names when possible
  • Document data sources and transformations
  • Keep notebooks reproducible with clear cell execution order

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Xlsx

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

tooldata

Skill Information

Category:Data
Last Updated:1/23/2026