Xarray For Multidimensional Data

by robertpelloni

datacli

Work with labeled multidimensional arrays for scientific data analysis using Xarray. Use when handling climate data, satellite imagery, oceanographic data, or any multidimensional datasets with coordinates and metadata. Ideal for NetCDF/HDF5 files, time series analysis, and large datasets requiring lazy loading with Dask.

Skill Details

Repository Files

4 files in this skill directory


name: xarray-for-multidimensional-data description: Work with labeled multidimensional arrays for scientific data analysis using Xarray. Use when handling climate data, satellite imagery, oceanographic data, or any multidimensional datasets with coordinates and metadata. Ideal for NetCDF/HDF5 files, time series analysis, and large datasets requiring lazy loading with Dask.

Xarray for Multidimensional Data

Master Xarray, the powerful library for working with labeled multidimensional arrays in scientific Python. Learn how to efficiently handle complex datasets with multiple dimensions, coordinates, and metadata - from climate data and satellite imagery to experimental measurements and simulations.

Official Documentation: https://docs.xarray.dev/

GitHub: https://github.com/pydata/xarray

Quick Reference Card

Installation & Setup

# Using pixi (recommended for scientific projects)
pixi add xarray netcdf4 dask

# Using pip
pip install xarray[complete]

# Optional dependencies for specific formats
pixi add zarr h5netcdf scipy bottleneck

# Geospatial extensions (for raster data, CRS handling, reprojection)
pixi add rioxarray xesmf

# DataTree is built into Xarray (no separate installation needed)

Essential Xarray Concepts

import xarray as xr
import numpy as np

# DataArray: Single labeled array
temperature = xr.DataArray(
    data=np.random.randn(3, 4),
    dims=["time", "location"],
    coords={
        "time": ["2024-01-01", "2024-01-02", "2024-01-03"],
        "location": ["A", "B", "C", "D"]
    },
    name="temperature"
)

# Dataset: Collection of DataArrays
ds = xr.Dataset({
    "temperature": temperature,
    "pressure": (["time", "location"], np.random.randn(3, 4))
})

Essential Operations

# Selection by label
ds.sel(time="2024-01-01")
ds.sel(location="A")

# Selection by index
ds.isel(time=0)

# Slicing
ds.sel(time=slice("2024-01-01", "2024-01-02"))

# Aggregation
ds.mean(dim="time")
ds.sum(dim="location")

# Computation
ds["temperature"] + 273.15  # Celsius to Kelvin
ds.groupby("time.month").mean()

# I/O operations
ds.to_netcdf("data.nc")
ds = xr.open_dataset("data.nc")

Quick Decision Tree

Working with multidimensional scientific data?
├─ YES → Use Xarray for labeled dimensions
└─ NO → NumPy/Pandas sufficient

Need to track coordinates and metadata?
├─ YES → Xarray keeps everything aligned
└─ NO → Plain NumPy arrays work

Working with geospatial raster data?
├─ YES → Use rioxarray for CRS-aware operations
└─ NO → Standard Xarray sufficient

Data has natural hierarchical structure?
├─ YES → Use DataTree for organization
└─ NO → Dataset/DataArray sufficient

Data too large for memory?
├─ YES → Use Xarray with Dask backend
└─ NO → Standard Xarray is fine

Need to save/load scientific data formats?
├─ NetCDF/HDF5 → Xarray native support
├─ Zarr → Use Xarray with zarr backend
└─ CSV/Excel → Pandas then convert to Xarray

Working with time series data?
├─ Multi-dimensional → Xarray
└─ Tabular → Pandas

Need to align data from different sources?
├─ YES → Xarray handles alignment automatically
└─ NO → Manual alignment with NumPy

When to Use This Skill

Use Xarray when working with:

  • Climate and weather data with dimensions like time, latitude, longitude, and altitude
  • Satellite and remote sensing imagery with spatial and temporal dimensions
  • Oceanographic data with depth, time, and spatial coordinates
  • Experimental measurements with multiple parameters and conditions
  • Simulation outputs with complex dimensional structures
  • Time series data that varies across multiple spatial locations
  • Any data where keeping track of dimensions and coordinates is critical
  • Large datasets that benefit from lazy loading and Dask integration

Core Concepts

1. DataArray: Labeled Multidimensional Arrays

A DataArray is Xarray's fundamental data structure - think of it as a NumPy array with labels and metadata.

Anatomy of a DataArray:

import xarray as xr
import numpy as np

# Create a DataArray
temperature = xr.DataArray(
    data=np.array([[15.2, 16.1, 14.8],
                   [16.5, 17.2, 15.9],
                   [17.1, 18.0, 16.5]]),
    dims=["time", "location"],
    coords={
        "time": pd.date_range("2024-01-01", periods=3),
        "location": ["Station_A", "Station_B", "Station_C"],
        "lat": ("location", [40.7, 34.0, 41.8]),
        "lon": ("location", [-74.0, -118.2, -87.6])
    },
    attrs={
        "units": "Celsius",
        "description": "Daily average temperature"
    }
)

Key components:

  • data: The actual NumPy array
  • dims: Dimension names (like column names in Pandas)
  • coords: Coordinate labels for each dimension
  • attrs: Metadata dictionary

2. Dataset: Collection of DataArrays

A Dataset is like a dict of DataArrays that share dimensions - similar to a Pandas DataFrame but for N-dimensional data.

Example:

# Create a Dataset
ds = xr.Dataset({
    "temperature": (["time", "location"], np.random.randn(3, 4)),
    "humidity": (["time", "location"], np.random.rand(3, 4) * 100),
    "pressure": (["time", "location"], 1013 + np.random.randn(3, 4) * 10)
},
coords={
    "time": pd.date_range("2024-01-01", periods=3),
    "location": ["A", "B", "C", "D"]
})

3. Coordinates: Dimension Labels

Coordinates provide meaningful labels for array dimensions and enable label-based indexing.

Types of coordinates:

Dimension coordinates (1D, same name as dimension):

time_coord = pd.date_range("2024-01-01", periods=365)

Non-dimension coordinates (auxiliary information):

# Latitude/longitude for each station
coords = {
    "time": time_coord,
    "station": ["A", "B", "C"],
    "lat": ("station", [40.7, 34.0, 41.8]),
    "lon": ("station", [-74.0, -118.2, -87.6])
}

4. Indexing and Selection

Xarray provides powerful label-based and position-based indexing.

Label-based selection (.sel):

# Select by coordinate value
ds.sel(time="2024-01-15")
ds.sel(location="Station_A")

# Nearest neighbor selection
ds.sel(time="2024-01-15", method="nearest")

# Range selection
ds.sel(time=slice("2024-01-01", "2024-01-31"))

Position-based selection (.isel):

# Select by integer position
ds.isel(time=0)
ds.isel(location=[0, 2])

Boolean indexing (.where):

# Keep only values meeting condition
ds.where(ds["temperature"] > 15, drop=True)

5. DataTree: Hierarchical Data Organization

DataTree is Xarray's class for organizing hierarchical (tree-structured) data. Think of it as a filesystem for datasets, where each node can contain a dataset and child nodes.

When to use DataTree:

  • Data at different resolutions or scales (multi-resolution imagery)
  • Measurements from multiple sensor types on the same system
  • Multi-model ensemble outputs with varying configurations
  • Experimental data with multiple trials or parameter sweeps
  • Heterogeneous data combining different domains or data types

Creating a DataTree:

import xarray as xr

# From a dictionary of datasets
dt = xr.DataTree.from_dict({
    "/": xr.Dataset({"description": "Root metadata"}),
    "/observations": xr.Dataset({"temp": (["time"], [15.2, 16.1, 14.8])}),
    "/observations/station_a": xr.Dataset({"location": "New York"}),
    "/observations/station_b": xr.Dataset({"location": "Los Angeles"}),
    "/model_outputs": xr.Dataset({"predicted_temp": (["time"], [15.0, 16.0, 15.0])})
})

# Access nodes using filesystem-like paths
print(dt["/observations/station_a"])
print(dt["observations"]["station_a"])  # Equivalent

Key DataTree operations:

# Navigate the tree
dt.parent  # Get parent node
dt.children  # Get child nodes dict
dt.subtree  # Iterate over all descendant nodes
dt.leaves  # Get all leaf nodes

# Apply operations across all datasets
dt.mean(dim="time")  # Apply to all nodes

# Map custom functions
dt.map_over_datasets(lambda ds: ds + 273.15)

# Filter nodes
dt.match("*/station_*")  # Pattern matching
dt.filter(lambda node: "temp" in node.ds.data_vars)  # Content-based filtering

# Coordinate inheritance (child nodes inherit parent coordinates)
# Define coordinates once at parent level, accessible in all children

Combining DataTrees:

# Arithmetic operations on isomorphic trees
dt1 + dt2  # Add corresponding datasets at each node

# Check structure compatibility
dt1.isomorphic(dt2)  # Returns True if same structure

6. Ecosystem Extensions

Xarray has a rich ecosystem of extensions for domain-specific workflows. For geospatial data analysis, prioritize rioxarray over vanilla Xarray.

Key geospatial extensions:

rioxarray - Geospatial raster operations:

import rioxarray

# Open raster with CRS (Coordinate Reference System) awareness
ds = rioxarray.open_rasterio("satellite_image.tif")

# Reproject to different CRS
ds_reprojected = ds.rio.reproject("EPSG:4326")

# Clip to bounding box
ds_clipped = ds.rio.clip_box(minx=-120, miny=35, maxx=-115, maxy=40)

# Write with CRS metadata
ds.rio.to_raster("output.tif")

Other useful extensions:

  • xESMF: Universal regridder for geospatial data (grid transformations)
  • Geocube: Convert vector data (GeoDataFrames) to raster (Xarray)
  • xarray-spatial: Numba-accelerated raster analytics (NDVI, terrain analysis)
  • Salem: Geolocation-based subsetting and masking

When to use which:

  • General geospatial rasters → rioxarray
  • Regridding between different grids → xESMF
  • Vector to raster conversion → Geocube
  • Fast raster computations → xarray-spatial
  • Geolocation operations → Salem

Patterns

See references/PATTERNS.md for detailed patterns including:

  • Creating DataArrays and Datasets
  • Reading and writing data
  • Selection and indexing
  • Computation and aggregation
  • Combining datasets
  • Dask integration for large data
  • Interpolation and regridding
  • Custom functions with apply_ufunc
  • Working with DataTree (hierarchical data)
  • Geospatial operations with rioxarray

Real-World Examples

See references/EXAMPLES.md for complete examples including:

  • Climate data analysis
  • Satellite data processing
  • Oceanographic data analysis
  • Multi-model ensemble analysis
  • Time series decomposition
  • Hierarchical climate model data with DataTree
  • Geospatial satellite data processing with rioxarray

Common Issues and Solutions

See references/COMMON_ISSUES.md for solutions to:

  • Memory errors with large datasets
  • Misaligned coordinates
  • Slow operations on chunked data
  • Coordinate precision issues
  • Dimension order confusion
  • Broadcasting errors
  • Encoding issues when saving
  • Time coordinate parsing issues

Best Practices Checklist

Data Organization

  • Use meaningful dimension and coordinate names
  • Include units and descriptions in attrs
  • Use standard dimension names (time, lat, lon, etc.) when applicable
  • Keep coordinates sorted for better performance
  • Use appropriate data types (float32 vs float64)

Performance

  • Chunk large datasets appropriately for your operations
  • Use lazy loading with open_dataset(chunks=...)
  • Avoid loading entire dataset into memory unnecessarily
  • Use vectorized operations instead of loops
  • Consider using float32 instead of float64 for large datasets

File I/O

  • Use NetCDF4 for general scientific data
  • Use Zarr for cloud storage and parallel writes
  • Include metadata (attrs) when saving
  • Use compression for large datasets
  • Document coordinate reference systems for geospatial data

Code Quality

  • Use .sel() for label-based indexing (more readable)
  • Chain operations for clarity
  • Use meaningful variable names
  • Add type hints for function parameters
  • Document expected dimensions in docstrings

Computation

  • Use built-in methods (.mean(), .sum()) over manual loops
  • Leverage groupby for categorical aggregations
  • Use .compute() explicitly with Dask
  • Monitor memory usage with large datasets
  • Use .persist() to cache intermediate results

Resources and References

Official Documentation

File Formats

Related Libraries

Geospatial Extensions

Domain-Specific Resources

Tutorials and Examples

Summary

Xarray is the go-to library for working with labeled multidimensional arrays in scientific Python. It combines the power of NumPy arrays with the convenience of Pandas labels, making it ideal for climate data, satellite imagery, experimental measurements, and any data with multiple dimensions.

Key takeaways:

  • Use DataArrays for single variables, Datasets for collections
  • Label-based indexing (.sel) is more readable than position-based
  • Leverage automatic alignment for operations between datasets
  • Use chunking and Dask for datasets larger than memory
  • NetCDF and Zarr are the preferred formats for scientific data
  • GroupBy and resample enable powerful temporal aggregations
  • Xarray integrates seamlessly with NumPy, Pandas, and Dask

Next steps:

  • Start with small datasets to learn the API
  • Use .sel() and .isel() for intuitive data selection
  • Explore groupby operations for categorical analysis
  • Learn chunking strategies for your specific use case
  • Integrate with domain-specific tools (Cartopy, Dask, etc.)

Xarray transforms complex multidimensional data analysis into intuitive, readable code while maintaining high performance and scalability.

Related Skills

Xlsx

Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas

data

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Clickhouse Io

ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.

datacli

Analyzing Financial Statements

This skill calculates key financial ratios and metrics from financial statement data for investment analysis

data

Data Storytelling

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

data

Kpi Dashboard Design

Design effective KPI dashboards with metrics selection, visualization best practices, and real-time monitoring patterns. Use when building business dashboards, selecting metrics, or designing data visualization layouts.

designdata

Dbt Transformation Patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

testingdocumenttool

Sql Optimization Patterns

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

designdata

Clinical Decision Support

Generate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug develo

developmentdocumentcli

Anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

arttooldata

Skill Information

Category:Technical
Last Updated:1/24/2026